'1 ! -^ 



jTf 



HOW TO MEASURE 
IN EDUCATION 




g^Acy^^o- 



THE MACMILLAN COMPANY 

NEW YORK • BOSTON • CHICAGO • DALLAS 
ATLANTA • SAN FRANCISCO 

MACMILLAN & CO., Limited 

LONDON • BOMBAY • CALCUTTA 
MELBOURNE 

THE MACMILLAN CO. OP CANADA. Ltd. 

TORONTO 



HOW TO MEASURE 
IN EDUCATION 



WILLIAM A; McCALL, Ph.D. 

ASSISTANT PROFESSOR OF EDUCATION 
AT TEACHERS COLLEGE, COLUMBIA UNIVERSITY 



THE MACMILLAN COMPANY 

1922 

All rights reserved 



PRINTED IN THE UNITED STATES OF AMERICA 



C^ 






Copyright, 1922, 
By the MACMILLAN COMPANY. 



Set up and electrotyped. Published, January, 1922. 



FEB-l 1922 



S)r',u854444 



PREFACE 

The science of measurement is in its infancy and the art 
of measurement is younger still. Yet both have developed 
at such a phenomenal rate in the last few years as to make 
this movement for the mental measurement of children the 
most dramatic tendency in modern education. Educators 
returning from a few years' sojourn in foreign missionary 
fields testify that no recent change in educational practices 
compares with that effected by the growth of scientific edu- 
cational and intelligence testing. During the war, and since, 
it was my privilege to confer with educational or military 
commissions sent here by several foreign governments to 
study our schools. These representatives testified that the 
extensive use of scientific mental measurement was one of 
the most distinctive features of American education. This 
book describes rather fully the meaning and methods of this 
movement. 

The art of measurement is younger than the science of 
measurement because the abler workers have, of necessity, 
devoted their energies almost exclusively to the origination 
of foundational techniques. But the whole movement was 
so promising in the way of concrete assistance in meeting 
educational problems that practical educators have irre- 
sistibly demanded that the science of measurement be turned 
into the art of measurement almost overnight. Hence the 
last two or three years has witnessed a feverish effort to 
meet these demands. The result has been numerous mis- 
takes and remarkable successes. 

This book has a fourfold aim. First, it aims to present 
these successes and warn against a repetition of the mis- 
takes. 

Secondly, it aims to help forward the movement for mak- 



^j Preface 

im teaching a genuine profession. Effective teaching re- 
quleTa skll and a command of refined P--dure m ^^ 
of that needed by the physician or surgeon. The medical 
profession is a genuine profession just because its members 
are experts in refined procedure. This book is one of the 
two which I have planned to show that tangible learnable, 
refined techniques are possible in teaching and that even a 
^^born" teacher will not, in a few years, be able to compete 
with a professionally trained teacher. 

In the third place this book aims to meet the needs ot 
the educators who are interested in both the How and the 
Why. Extremely elementary books in this field have done 
an admirable service in diffusing an interest in mental meas- 
urement. The more intelligent teachers are now asking, 
however, for a book which not only brings the art of meas- 
urement up to date, but which, in addition, goes sufficiently 
into the science of it that they may be able to use tests less 
blindly than heretofore. 

The final aim of this book is to bring together in one 
convenient volume most of the techniques needed by those 
engaged in mental measurement. At present the worker in 
this field must go to one book to learn how to construct a 
mental test, to another book to learn how to give and use 
the results of the test, to another book to learn how to 
apply statistical methods, and to still another book to dis- 
cover methods for graphic and tabular presentation. The 
expert will and should continue to consult these special 
treatises. Others do not like to give the necessary time. 
This book then, is really several small books in one. Fur- 
thermore, it has been so arranged that the reader can con- 
fine himself to the earlier and less technical portions or he 
can omit this and study the more technical chapters placed 
toward the end. 

As I think through the contents of this book and reflect 
upon the influences which have made it, I am keenly con- 
scious that it is both autobiography and biography. It is 
autobiography because several years of my own thought and 



Preface vii 

labor and much experimentation in the world's most stimu- 
lating educational laboratory, Teachers College and New 
York City, have gone into it. 

This book is a biography because from cover to cover I 
see the tangible and relatively intangible evidences of my 
many teachers. My pupils have been my teachers. My 
contemporaries in mental measurement have been my teach- 
ers. In so far as my mention can accomplish it, I wish, 
however, to honor particularly five individuals. The first 
is, and has been, an elementary school teacher in the moun- 
tains of Kentucky and Tennessee. The second is President 
Emeritus of Cumberland College, Williamsburg, Ky. The 
third is President of Lincoln Memorial University, Cumber- 
land Gap, Tennessee. The influence of the fourth is re- 
sponsible for my interest in realms which transcend those 
represented in this book. The ideas of the fifth are so com- 
pletely a part of me that it might be said that he wrote the 
book through an imperfect medium. They are Jesse Wor- 
ley, E. E. Wood, George A. Hubbell, Frank M. McMurry, 
and Edward L. Thorndike. 

Wm. a. McCall 



CONTENTS 

PART ONE 

How TO Use Measurement 



CHAPTER 



PAGE 



I Place of Measurement in Education ... 3 

II Measurement in Classifying Pupils ... 19 

III Measurement in Diagnosis 67 

IV Measurement in Teaching 112 

V Measurement in Evaluating Efficiency of 

Instruction 149 

VI Measurement in Vocational Guidance . . . 169 

PART TWO 

How to Construct and Standardize Tests 
VII Preparation and Validation of Test Material 195 
VIII Organization of Test Material and Prepara- 
tion OF Instructions 227 

IX Scaling the Test 249 

X Scaling the Test — T Scale 272 

XI Determination of Reliability, Objectivity, 

AND Norms 307 

PART THREE 

Tabular, Graphic, and Statistical Methods 
XII Tabular Methods 321 

XIII Graphic Methods 331 

ix 



X. 



Contents 



CHAPTER PAGE 

XIV Statistical Methods — Mass Measures . . . 354 
XV Statistical Methods — Point Measures . . . 365 
XVI Statistical Methods — Variability Measures . 378 
XVII Statistical Methods — Relationship and Re- 
liability Measures 388 

APPENDIX 

How TO Secure Tests and Directions for Their Use . 409 

Index 411 



LIST OF TABLES 

TABLE PAGE 

1. Retardation and acceleration in towns and cities 23 

2. Classification of pupils on the basis of educational age and E. Q. 26 

3. For converting pupils' composite scores on educational tests into 

educational ages . . 34 

4. Reclassification and placement table 47 

5. Distribution of changes made in reclassifying a school , . .51 

6. Distribution of changes made in reclassifying another school . 52 

7. Effect of specially promoting pupils 53 

8. Data needed to guide instruction in reading 67 

9. For transmuting number of questions correct into T scores . . 72 

10. Grade norms on the Thorndike-McCall Reading Scale ... 72 

11. For transmuting T scores into reading ages 73 

12. Sample tabulation form for Thorndike-McCall Reading Scale . 74 

13. Scores made on an informal silent reading test 137 

14. Efficiency measurement with several standard tests 158 

15. Reading efficiency of Baltimore white and colored schools . . 161 

16. Interpretation of efficiency by means of grade unit and relative 

position 162 

17. Construction of percentile table 254 

18. Computation of percentile score 255 

19. Computation of age score and E. Q 257 

20. For converting per cents correct into P. E. distances .... 259 

21. Illustrating conversion of per cents correct intO' P. E. distances . 259 

22. Conversion of per cents of better judgment into P. E. differences 

in merit 266 

23. For converting per cents into S. D. distances 274 

24. Shows how to scale total scores 279 

25. Age distributions for the Thorndike-McCall Reading Scale . . 280 

26. Comparison of equal-step and unequal-step scales 302 

27. Simple-total vs. cumulative-total method of combining units . . 303 
28 Illustrates proper construction of a table 326 

29, Illustrates an effective method of tabular presentation . . . 329 

30, Spelling scores made by 64 fourth grade pupils . . . . . . 354 

31, Shows frequency distributions of the same data grouped in step 

intervals of 2 and step intervals of 4 361 

32 Computation of mean for scores ungrouped and grouped in step 

intervals of i 357 

32 Computation of mean for scores ungrouped and grouped in step 

intervals of 10 369 

xi 



xii List of Tables 

TABLE PAGE 

34. Computation of Qi, median, and Qa for scores ungrouped and 

grouped in step intervals of i 371 

35. Computation of Qi, median, and Q3 for scores ungrouped and 

grouped in step intervals of 10 373 

36. Computation of Qi, median, and Qs when step intervals are unequal 375 

37. Computation of Qi, median, and Qs when there are zeros in the 

frequency column 376 

38. Computation of mean deviation for scores ungrouped and grouped 

in step intervals of i ' 381 

39. Computation of mean deviation for scores ungrouped and grouped 

in step intervals of 10 383 

40. Computation of S. D. for scores ungrouped and grouped in step 

intervals of i 384 

41. Computation of S. D. for scores ungrouped and grouped in step 

intervals of 10 386 

42. Computation of / 390 

43. Computation of i? 392 

44. For transmuting R into r 393 

45. How to interpret r 394 

46. Summary of statistical results 400 

47. Conversion of experimental coefficient into statement of chances 405 

48. Statistical problems and answers 407 



LIST OF DIAGRAMS 

FIGURE PAGB 

1. Comparison of E. Q.'s and I. Q.'s 41 

2. Overlapping of educational ages for two adjoining grades . . 43 

3. Estimation of true age means . . 284 

4. How to fill out individual record card for one of Courtis' Standard 

Supervisory Tests 321 

5-21. Illustrating standard methods for graphic presentation by means 

of bar and curve diagrams 334 

22. A sector diagram 345 

23. A sectioned-bar diagram 34^ 

24. A combination bar-and-sectioned-bar diagram 347 

25. Frequency surface showing identification of components . . . 349 

26. An approximately normal frequency surface 355 

27. A normal frequency surface 356 

28. Minus skewed frequency surface 357 

29. Plus skewed frequency surface 357 

30. Multi-modal frequency surface 359 

31. A frequency surface with step intervals of i 360 

32. A frequency surface with step intervals of 2 . . . . . . 360 

33. Rectilinear relationship with an r of .303 395 

34. Rectilinear relationship with an r of .8 395 

35. Rectilinear and curvilinear relationships 39^ 



PART ONE 

HOW TO USE MEASUREMENT 

CHAPTER I. PLACE OF MEASUREMENT IN EDUCATION 

CHAPTER II. MEASUREMENT IN CLASSIFYING PUPILS 

CHAPTER III. MEASUREMENT IN DIAGNOSIS 

CHAPTER IV. MEASUREMENT IN TEACHING 

CHAPTER V. MEASUREMENT IN EVALUATING EFFI- 
CIENCY OF INSTRUCTION 

CHAPTER VI. MEASUREMENT IN VOCATIONAL GUID- 
ANCE 



HOW TO MEASURE IN EDUCATION 

CHAPTER I 
PLACE OF MEASUREMENT IN EDUCATION 

THESIS I. "WHATEVER EXISTS AT ALL, 
EXISTS IN SOME AMOUNT'^ 

It is possible to become so immersed in the details of the 
measurement of pupil achievement as to lose sight of its 
fundamental significance. It is this absence of perspective 
which is responsible for two educational afflictions-|-the lop- 
sided enthusiast for the scientific measurement of education, 
and the^qually unbalanced opponent of the movement. 
Educational measurement has a sort of philosophy. I have 
attempted to condense the main elements of this philosophy 
into a series of theses which may help those who have not 
had much time to think along these lines to appreciate the 
true place which measurement should have in education. 
The first of these theses is stated above. 

Since all sane persons accept this thesis it needs no 
qualification, but a qualified thesis will suffice for our pur- 
pose, namely, whatever change the teacher makes in a pupil 
must be a change in an amount of something. We teachers 
will scarcely insist that our effort makes no change in 
amount. Even though such were the result of our effort 
it would not so much disprove the thesis but rather prove 
our own inefficiency. 

There is an ever-dwindling group who strenuously oppose 

^ E. L. Thorndike, The Seventeenth Year Book of the National Society for the 
Study of Education, Part II, p. i6; Public School Publishing Co., Bloomington, 111. 



4 How to Measure in Education 

the practical implications of the above thesis. They claim 
to be interested in the emancipation of education from 
the quantitative idea. Their effort is directed toward the 
qualitative in education. According to them there is in 
every person a non-quantative quality — a 

"... something far more deeply interfused, 
Whose dwelling is the light of setting suns." 

Did they truly "see into the life of things" they would 
realize that there is never a quantity which does not measure 
some quality, and never an existing quality that is non- 
quantitative. Even our halos vary in diameter. 

THESIS 2. ANYTHING THAT EXISTS IN 
AMOUNT CAN BE MEASURED 

At least half a dozen scales now exist by which it would 
have been possible to measure the quality of the Hand- 
writing on the Wall. Faust said: 

"What she reveals not to thy mental sight 
Thou wilt not wrest from her with levers and with screws." 

But science has enormously increased the subtlety of levers 
and screws, and our mental sight is obtuse compared to 
some of our present-day mental tests. Jesus tacitly ac- 
cepted and practiced mental measurement when He esti- 
mated the quantity of faith on a mustard-seed scale. 

It is possible to measure, at least crudely, an individual's 
love of a sunset or appreciation of opera. Theoretically 
the thesis is sound but whether practically we shall ever 
possess sufficient ingenuity to discover all the things that 
exist in amount and then measure them with any great 
accuracy, is a question. All that is necessary to accept for 
the present is that all the abilities and virtues for which 
education is consciously striving can be measured and be 
measured better than they ever have been. The measure- 
ment of initiative, judgment of relative values, leadership, 
appreciation of good literature and the like is entirely pos- 
sible. We already have a scientific scale for the measure- 



Place of Measurement in Education 5 

ment of poetic appreciation. The measurements may not 
be as exact as we might wish, but they would have value. 



THESIS 3. MEASUREMENT IN EDUCATION IS IN GEN- 
ERAL THE SAME AS MEASUREMENT IN THE PHYSI- 
CAL SCIENCES. 

The two types of measurement are fundamentally alike 
because both measure physical manifestation. Neither 
adding ability, nor good intentions can be measured by 
plunging a thermometer into a pupiPs spiritual medium, but 
they can be by measuring his behavior and judging his 
inner condition therefrom. Unless the witness is a habitual 
liar, psychologists can, with considerable success, determine 
by means of a breathing curve, when a witness is not telling 
the truth. 

In a still invisible future it may be possible to secure a 
"movie" of a pupiFs mental machinery when in operation 
and thus secure the desired information but for the present 
it is necessary to measure the product produced and, if 
desired, infer the inner condition of the pupil. 

Measurement must frequently meet the objection of 
being too mater iahstic. Listen to Gilder in "The Poet's 
Protest." 

"O man with your rule and measure, 
Your tests and analyses! 
You may take your empty pleasure, 
May kill the pine, if you please, 
You may count the rings and the seasons, 
May hold the sap to the sun. 
You may guess at the ways and the reasons 
Till your httle day is done." 

To parody Wagner in "The Better Way," one would think 
that it was the purpose to measure human worth by the ell, 
the value of a life by the number of its years, the painter's 
canvas by the yard, or the work of the poet by the pound 
or bushel. A student writes: "Measurement should not be 
applied where spiritual factors and ideal values are in- 
volved." Those educators who protest most violently 
against any such measurement of the pupil are daily probing 



6 How to Measure in Education 

his mental activity by methods which are comparable to 
the surgical operations of bygone ages. They find them- 
selves in a position of disapproving the lover who estimates 
his lady's affection by the radius of the pupil of her eye 
under standardized lighting, and of approving the scientific 
father who soothes the mother for his punishment of their 
infant by saying: "I am not slapping an innocent soul but 
spanking a physiological reaction." 

THESIS 4. ALL MEASUREMENTS IN THE PHYSICAL 
SCIENCES ARE NOT PERFECT 

Physical measurements are, in general, more exact than 
educational measurements but education has no monopoly 
upon imperfect tests. There are tests which are now the 
rule in physical sciences for which an expert in educational 
measurements would blush. The general superiority of 
physical measurements is not due to the fact that they are 
radically different in kind. Physical measurements are 
subject to practically all the errors which trouble educa- 
tional measurement. It is not that they do not exist in the 
former, but that they usually exist in such small amounts 
that the average person fails to see them. They are large 
enough to be the despair of experts in the various sciences. 
Thorndike ^ has given us an excellent statement of this 
point: 

'^Nobody need be disturbed at these unfavorable contrasts 
between measurements of educational products and meas- 
urements of mass, density, velocity, temperature, quantity 
of electricity, and the like. The zero of temperature was 
located only a few years ago, and the equality of the units 
of the temperature-scale rests upon rather intricate and 
subtle presuppositions. At least, I venture to assert that 
not one in four of, say, the judges of the Supreme Court, 
bishops of our churches, and governors of our states could 
tell clearly and adequately what these presuppositions are. 
Our measurements of educational products would not at 

»Op. cit. p. 18. 



Place of Measurement in Education 7 

present be entirely safe grounds on which to extol or con- 
demn a system of teaching reading or arithmetic, but many 
of them are far superior to measurements whereby our 
courts of law decide that one trade-mark is an infringement 
on another." 

But the imperfections of educational measurements are, 
in general, far more glaring than the majority of those made 
in physics, chemistry and like sciences. Some may have 
gotten the impression from certain emotional and quixotic 
radicals that standard tests are perfect instruments. This 
is far from the truth. They have numerous and decided 
limitations. The recent somewhat unbalanced, but doubt- 
less necessary, propaganda is justified not because of the 
perfection of the tests recommended, but the much greater 
imperfection of the tests or lack of tests now in use. 

A common criticism of educational measurement is that 
the tests measure a narrow, limited segment of a pupil's 
totahty. Physical measurements tend to be more handi- 
capped in this respect than educational measurements. 
Most of their measurements, such as measurements of 
length, width, weight, and temperature are exceedingly nar- 
row abstractions and they are exceedingly useful too. A 
totality test for a pupil would certainly be useful but if we 
possessed one we would proceed immediately to construct 
tests for the detailed measurement of pupil abilities. Scales 
for the measurement of composition are useful, but scales 
for the measurement of the elements which go to make up 
composition are also useful. Teachers not only teach chil- 
dren ''all over"; they teach them in detail. If tests are to 
aid instruction effectively, there is as much need for them 
to measure in detail as in totality. 

THESIS 5. MEASUREMENT IS INDISPENSABLE TO 
THE GROWTH OF SCIENTIFIC EDUCATION 

Exact measurement has made possible the rapid progress 
in the natural sciences. It has been stated that the amount 



8 How to Measure in Education 

of soap used is an index of the civilization of a country. 
The exactness of measurement is a good index of the status 
of a science. Consider where science would be without 
its meter, gram, ampere, volt, ohm, watt, henry and the like. 
More than anything else it has been the absence of exact 
measurement which has kept education from the rank of a 
science. This plea for the development of those instruments 
which will make possible the progress of education as a 
science is made with knowledge of a recent statement by a 
prominent educator: 'T think it would be disastrous if 
education were reduced to an exact science." 

Richards,^ in his presidential address before the American 
Association for the Advancement of Science, said: ^Tlato 
recognized, long ago, in an often-quoted epigram, that when 
weights and measures are left out, little remains of any art. 
Modern science echoes this dictum in its insistence on quan- 
titative data; science becomes more scientific as it becomes 
more exactly quantitative." 

In fact, measurement and education are like the twin girls 
whose hair the mother of many children braided together. 
Neither of the twins could move unless they both moved 
together. 

Foote gives the above quotation in a nutshell when he 
says: ^'The day of guesswork must give way to definite 
facts supported by undebatable evidence." 

There are those who tremble lest the development of 
education as a science will squeeze out of life its emotions 
and delicate perceptions. As well fear that woman suffrage 
or the '^ female" in industry will destroy gallantry among 
men. The roots of these fine things go too deep into 
human nature. In an especially unhappy mood Amiel 
writes: ''Philosophy will clip an angel's wings," and again, 
"Science is a lucid madness engaged in tabulating its own 
necessary hallucinations." The basic function of Science 
is to help us to attain our objectives in the quickest and 
most economical way, whether the objectives be material or 

8 "The Problem of Radioactive Lead"; Science, Jan. 3, 1919. 



Place of Measurement in Education 9 

spiritual. Science is frequently looked upon as materialistic 
chiefly because only those persons who seek material objec- 
tives have had the good sense to secure the aid of Science. 
Haeckel, who has just drawn the line under his life's work, 
must have had in mind the unnecessary inefficiency of ideal- 
ism when he wrote: ''No cosmic problem was ever solved 
or even advanced by that cerebral function we call 
emotion." For centuries education has been like an emo- 
tional dog chasing a frantic tail. We have had a long line 
of great educational thinkers from Plato through Pestalozzi, 
and Froebel ... to Dewey and beyond. "The old order 
changeth, yielding place to new," but no one seems to 
know whether the old or the new is better. In fact, there is 
grave suspicion that we move in an orbit whose form is the 
circle. These educational leaders are not answering ques^ 
tions. They are asking questions which do not occur to 
others. They are proposing problems for experimentation. 
The final answer to every educational question, except one, 
must be left to the educational measurer and must await the 
development of education as a science. 

THESIS 6. MEASUREMENT IN EDUCATION IS 
BROADER THAN EDUCATIONAL TESTS 

This book is not entitled Educational Tests because 
there are other methods of measuring pedagogical products. 
As previously stated, some estimate the quality of instruc- 
tion by investigating the material equipment of libraries 
and laboratories and class rooms, or by the academic or 
professional training of the teacher. Others measure in- 
struction by observing the teacher's method and by forming 
an opinion on the basis of these observations. Others base 
their judgments upon detailed observations of the behavior 
of pupils. Still others test the pupils by means of examina- 
tions. This book attempts to discuss the basic principles 
of measurement which apply not only to educational tests 
but to any sort of educational measurement. Some of the 
above methods of measuring educational results we are 



10 How to Measure in Education 

likely to have with us for some time to come. In our zeal 
for improving tests proper, we should not neglect the refine- 
ment of these methods. The main emphasis should be, and 
in this book will be, upon tests, because they offer the best 
promise for exact measurement. For if we can trust the 
experience in other fields, measurement by means of some 
sort of instruments will gradually replace all other forms. 
Finally, the book is not entitled Educational Measure- 
ment because education is deeply concerned with measure- 
ments which are not exactly products of instruction but 
about which educators need to be critical. 

THESIS 7. THERE ARE OTHER THINGS IN EDUCATION 
BESIDES MEASUREMENT 

It will doubtless come as a surprise to the reader to hear 
one who is interested in the scientific measurement of 
education make such an admission. There are at least three 
other important factors in education, namely, pupil, methods 
and material, and goals. The teacher needs to know the 
psychology of the pupil, the proper goals to which the pupil 
is to be developed, and the methods and material which 
should be employed to develop the pupil from his initial 
ability to the desired goal. What does measurement have 
to do with this process? 

THESIS 8. TO THE EXTENT THAT THE PUPIL'S 
INITIAL ABILITIES OR CAPABILITIES ARE UN- 
MEASURABLE A KNOWLEDGE OF HIM IS IMPOS- 
SIBLE 

It has just been stated that a teacher needs the most inti- 
mate possible knowledge of a pupil in order to know what 
methods and materials to employ in order to help him most 
quickly to attain a desired goal. We partly know a pupil 
when we know the abilities and capabihties which he pos- 
sesses. To determine the mere existence of an ability 
involves a crude measurement. But if w^e know no more 
than this we cannot tell whether a pupil has these abilities 



Place of Measurement in Education ii 

in sufficient quantities to permit him to matriculate for a 
Ph.D. in the university or just enough to enter the kinder- 
garten. We must know not only what qualities exist, but 
also in what amount they exist, and the more exactly we 
know this amount the better. Measurement is essential to 
a practical knowledge of psychology. 

THESIS p. ''TO THE EXTENT THAT ANY GOAL OF 
EDUCATION IS INTANGIBLE IT IS WORTHLESS''* 

We want to be able to answer at least three things about 
any goal: (i) What is the worth of the goal? (2) What 
is the location of the goal? (3) Is the pupil moving toward 
or from the goal? Measurement is necessary to answer 
each one of these absolutely vital questions. Suppose it 
be said that one goal of instruction is to produce in the 
pupil an ability to write. The worth of this goal depends 
upon an exact or crude measurement of how much penman- 
ship contributes to the efficiency of a number of other 
superior activities. The goal has advanced little beyond 
perfect intangibility until it is located. How much ability 
to write? What speed? What quality? Even the worth 
of the goal cannot be answered until this location is made, 
since the worth varies with the quantity. The very words 
how much imply and in fact require measurement. Finally, 
it is necessary to answer the question: Is the pupil moving 
toward or from the goal? Without measurement the ques- 
tion is unanswerable. 

THESIS 10. THE WORTH OF THE METHODS AND 
MATERIALS OF INSTRUCTION IS UNKNOWN UNTIL 
THEIR EFFECT IS MEASURED 

The purpose of certain methods and materials is to help 
the pupil grow toward a certain goal. Do the methods 
employed accomplish their purpose? We cannot tell with- 
out employing measurement. For aught we know, the 
methods may be actually vicious. They may be forming 

* I am indebted to F. M. McMurry for this thesis. 



12 How to Measure in Education 

habits which not only do not lead toward the goal, but which 
may be building up difficulties for another method by a 
subsequent teacher. It is equally true that the compara- 
tive worth of different methods and materials is unknown 
until their effect upon the pupil is measurable. This means 
that measurement is indispensable to the experimental 
selection of the most economical educational conditions. 

Thus, measurement is everywhere in education and in our 
daily lives. Measurement is no rare freak. It gets up 
with us in the morning and goes to bed with us at night. 
The mile stone, the hand of the watch, the humble cup in 
the kitchen, the lengthening shadows of the trees on the 
grass, the spacing of the year into seasons, all indicate how 
ubiquitous measurement is. And measurement is just as 
immanent in the whole educational process as in life in 
general. There are other things in education besides meas- 
urement but they have no value so long as they are dissoci- 
ated from it. 

THESIS II. MEASUREMENT OF ACHIEVEMENT 
SHOULD PRECEDE SUPERVISION OF TEACHING 
METHOD 

Education is now being measured in two ways. When a 
child, I watched two coal miners lift a derailed car. Their 
efforts illustrate these two methods of measurement. A 
lever and fulcrum were brought, but the lever broke. A 
stronger lever was secured, but the fulcrum was too far 
from the car. Finally the proper adjustments were made 
and the car was lifted. Whether or not the car was lifted 
could be determined in two ways; (i) by measuring the 
length of the lever, the resistance of the fulcrum and the 
ground under the fulcrum, the weight of the men, the point 
of application of their weight, the distance of this point to 
the fulcrum, the distance from the fulcrum to the car, the 
weight of the car; or (2) simply by determining whether 
the car was actually lifted. 

It is a fair assumption that the crucial purpose of elemen- 



Place of Measurement in Education 13 

tary education is to make certain changes in children. To 
this end we have surrounded them with levers and fulcra 
in the shape of books, pictures, maps, tools, playthings, 
pedagogical methods and with teachers who will utilize these 
instruments as leverages to produce the desired changes. 

Again it is a fair assumption that the schools should know 
whether their levers, fulcra, etc., are really producing the 
changes desired. As in the case of the derailed car, there 
are two methods of measuring these changes; (i) strength 
of lever, length of leverage, etc., become the number and 
nature of the books in the libraries, map facihties, black- 
board space, and such, and the weight of the men becomes 
the number of diplomas possessed by the teacher or else 
the amount of her skill in making provision for motive, 
initiative and such on the part of her pupils. (2) Whether 
the car is actually lifted is comparable to measuring directly 
the changes in the pupils. 

Doubtless our relatively primitive ancestors held con- 
ferences to discuss the advisability of such and such arrange- 
ments of lever and fulcrum in lifting a weight. Of course 
such possible discussions never were and never could be 
settled until the crucial measurement — the direct measure- 
ment was made. It would be of inestimable value to know 
whether the presence of certain books in the schoolroom, or 
the possession of a certain amount of professional training 
on the part of the teacher and the like are prerequisites of 
certain defined changes in pupils. Without such ancillary 
measurements by teachers and supervisors, the conditions 
for pupils' growth cannot be arranged in advance with cer- 
tainty. But we shall not arrive at such knowledge except 
through direct measurement. We certainly cannot claim to 
know the exact casual relation between defined changes in 
pupils, and most of the paraphernalia with which the pupil 
is now surrounded. In spite of our ignorance of these causal 
relations, the chief method of supervision at present is to 
attempt to judge the presence or absence or amount of 
presence of these levers and fulcra. 



14 How to Measure in Education 

THESIS 12. MEASUREMENT IS NO RECENT 
EDUCATIONAL FAD 

Judging from the vituperation that has been heaped upon 
it, and the efforts that have been necessary to propagate it, 
one would think that scientific measurement was something 
absolutely novel. As a matter of fact, educators are, and 
have always been confirmed users of measurement — meas- 
urement of a kind. For several generations teachers have 
been employing tests which, to the uninitiated observer, 
would differ from standard tests in only one respect. The 
teacher's test is usually written on the blackboard while the 
standard test is usually printed on paper. Present a stand- 
ard test to a teacher or principal who never heard of one, 
and neither will recognize that it is possessed of any peculiar 
virtues or patent dangers. Ayres tells us: ^'If Dr. Rice is 
to be called the inventor of educational measurement. Pro- 
fessor E. L. Thorndike should be called the father of the 
movement." And yet, if the great majority of us had 
thought of standard tests before Dr. Rice, or scaled tests 
before Dr. Thorndike, we probably should not have deemed 
the ideas worth enough to spend time upon or dangerous 
enough to frantically bury. 

The writer's experience with the critics of standard tests 
convinces him that these critics have but two important 
objections, first, tests are not available for measuring all 
the aims of instruction, and, second, tests are sometimes 
misused. The first objection calls for, not the disuse of 
tests, but greater zeal in the extension of tests. The second 
objection calls for zeal, not against tests, but against their 
misuse. The closest students of scientific measurement are 
rarely its opponents and at the same time they are its sever- 
est critics. They are the severest critics because their 
criticisms are pertinent and because they are aware of 
numerous defects invisible to the casual observer. 

Measurement in education did not suddenly leap into 
existence. It has had a gradual evolution, or rather it has 
been on a plateau for centuries. A student's theme informs 



Place of Measurement in Education 15 

us that: "Educational measurement is ancient as a fact, 
medieval as a process and modern as a science. Half of 
Solomon's proverbs are tests for wisdom." The Chinese 
had a far-flung system of testing which was a sort of begin- 
ning for the Hillegas Composition Scale. The Roman father 
considered his son's literary education finished when his 
son could read the Roman Law from the tablet in the public 
forum. Little progress was made beyond the conventional, 
formal examination until 1894. Rice conceived the idea of 
a comparative test to be used in measuring the results of 
instruction in many schools. Out of the comparative test 
grew norms, for the use of a comparative test upon many 
schools yields norms. It was the genius of Thorndike that 
made possible the next advance. Utilizing the Cattell- 
Fullerton equal-distance theorem, he devised a scale unit 
for the measurement of educational achievement. This 
marks the beginning of scientific educational measurement. 
Stone's Arithmetic Tests worked out under the direction of 
Thorndike and published in 1908, represent a sort of 
transition from the Rice comparative tests to the Thorndike 
Handwriting Scale published in 1909. Subsequent students 
of Thorndike's have elaborated the statistical technique for 
the construction of educational scales. Hillegas, Bucking- 
ham, Trabue, and Woody constructed respectively the 
Composition Scale, Spelling Scale, Language Scale and 
Fundamentals of Arithmetic Scale. 

The movement for the scientific measurement of educa- 
tion has spread with great rapidity. Courtis has been par- 
ticularly successful in disseminating an interest in tests. 
Hence it is appropriate that he should have directed the 
testing in the first formal survey where tests were employed. 
The survey was the New York City Survey of 1911-12 and 
the tests used were the Courtis Arithmetic Tests. Since that 
time the movement has grown apace until now tests and 
scales are in daily use throughout this country and around 
the world. Every school survey relies upon tests as one 
of its chief instruments for evaluating the efficiency of the 



1 6 How to Measure in Education 

schools being surveyed. The Gary Survey by the Rocke- 
feller Foundation spent over $10,000 upon tests of pupils. 
Most of the universities and many of the colleges and 
normal schools give courses in educational measurement. 
Several universities have a bureau of research which 
cooperates with the school communities in its region in the 
measuring of education. There are now about twenty-five 
formal city bureaus of research and the number is increas- 
ing. Great foundations like the Rockefeller Foundation are 
forwarding this movement. All are familiar with the splen- 
did work of Ay res in connection with the Russell Sage Foun- 
dation. Hundreds of isolated workers are adding impetus 
to the movement through their earnest study of educational 
problems by means of scientific measurement. But perhaps 
the greatest single force for the advancement of scientific 
educational and psychological measurement will prove to be 
the war work of the Psychology Committee of the National 
Research Council. Yerkes,^ chairman of this committee, 
writes: 

"It is already evident that the contributions to methods 
of practical mental measurement made by this committee 
of the National Research Council, and by the psychological 
personnel of the army, are profoundly influencing not only 
psychologists, but educators, masters of industry and the 
experts in diverse professions. New points of view, interest 
and expectations abound. The service of psychological 
examining in the army has conspicuously advanced mental 
engineering, and has assured the immediate application of 
methods of mental rating to the problems of classification 
and assignment in our educational institutions and our in- 
dustries." 

In the words of an enthusiastic but beginning student, "the 
importance of this measuremental (!) movement has been 
realized in education." 

5 Robert M. Yerkes, "Report of the Psychology Committee o£ the National 
Research Council"; The Psychological Review, March, 1919- 



Place of Measurement in Education 17 

THESIS 13. TESTS WILL NOT MECHANIZE EDUCA- 
TION OR EDUCATORS 

There seems to be a feeling that tests favor the so-called 
mechanical or conservative rather than radical methods in 
education. When properly used, they favor neither one. 
Ultimately tests will be the judge to give an impartial 
decision as to which method is the more effective. Until 
scientific measurement is extended, however, no decision 
between the two methods can be reached, because present 
tests cannot measure some of the most important aims of 
both educational conservatives and radicals. Suffice it to 
state here that present standard tests when improperly used 
may easily cause a greater mechanization of education, but 
when properly used they may easily be the salvation of 
education from too great a mechanization. The defense of 
this statement will appear later. 

Much less is there any ground for believing that tests 
will squeeze the humanity out of teachers. The teacher 
should be the master of the instrument, not vice versa. It 
is to be hoped that there are no teachers like the farmer who 
was uncertain whether he was v/orking to support ten cows 
or they were working to support him. If there are such 
teachers, tests cannot injure them. They are beyond in- 
jury. It is not probable that because a teacher can measure 
a pupil's ability in handwriting her interest in the finer 
things of life will evaporate. There is food for reflection 
in the statement by Thorndike that it is not the mothers who 
weigh their babies least often who love them most. 

THESIS 14. TESTS WILL NOT PRODUCE A 
DEADLY UNIFORMITY 

Perhaps it would be more accurate to say: tests need 
not produce a deadly uniformity. Tests need not destroy 
individuality in pupils. There exists in the minds of some 
a fear that composition, handwriting and drawing scales 
placed in the schoolroom before the pupils or even used by 



1 8 How to Measure in Education 

the teacher to measure pupil product, will tend to discourage 
individuality. This fear is justified if there are, in a school, 
teachers who will instruct pupils to write their compositions 
just like the compositions on the scale, or to make their 
penmanship look just like the penmanship on the hand- 
writing scale. My faith in the common sense of the mem- 
bers of the teaching profession is so high that I do not hold 
it important to argue this question further. It must be 
evident to anyone that composition and handwriting scales 
are measuring instruments and not models to be imitated, 
except in so far as the increasing quality of the specimens 
on the scale are goals toward which to strive. In fact, when 
such scales are properly used, they should increase indi- 
viduality. By placing before the pupils definite objectives, 
the scales tend to increase interest in attaining these objec- 
tives, and it is a truism of psychology that the most prolific 
source of varied products and hence originality and individu- 
ality, is a powerful interest Let the destiny of a nation de- 
pend upon the development of anti-submarine devices, or the 
favor of a king depend upon the production of an original 
literary masterpiece, or the issue of a baseball champion- 
ship contest depend upon the development of a new 
strategy, and the conditions are very favorable for indi- 
vidual initiative. Thus scales are favorable to individuality 
when they are used as measuring instruments and for the 
location of definite objectives. They were never intended 
for models to be imitated. The teacher should urge the 
pupil to write a composition which is of as high a quality 
as a certain scale specimen and not to write a composition 
which is just like that specimen. Trabue once remarked 
that it is possible to feed an infant according to its weight 
without feeding it with the scoop. 



CHAPTER II 
MEASUREMENT IN CLASSIFYING PUPILS 

I. Classification by Intelligence Tests 

Bases and Objectives of Classification. — There are 
three main types and several minor types of measurement 
which may be used as bases of classification. The three 
chief types are intelligence measurements, educational 
measurements, and pedagogical measurements or teachers' 
marks. Medical measurements are frequently used to 
classify together pupils who are anemic. Chronological 
measurements have been used at Fairhope to classify pupils 
into life groups. There will be considered here only the 
three main types as they are used to classify pupils, not by 
separate subjects but by a sort of average of all subjects. 
The technique of classifying by separate subjects will prove 
easy to the one who masters the technique to be described. 

The first fundamental objective of classification is to put 
together those of equal educational status. It is believed 
that homogeneous groups will make more satisfactory 
progress, due to the fact that the teacher can teach such 
a group almost as one pupil. The needs of all pupils are 
then closely similar. The work can be more exactly adapted 
to all. It saves the wear and tear on the teacher of con- 
tinually shifting adjustment from one grade of ability to 
another. Franzen has described the instruction of teachers 
in non-homogeneous groups thus, "they mystify the lower 
quarter and bore the upper quarter." 

The second fundamental objective of classification is to 
put together those who will progress at equal rate. At the 
best, periodic reclassification will be necessary. These will 

19 



20 How to Measure in Education 

need to be much more frequent if provision is made for 
equal initial ability only and not for equal rate of progress. 
Provision should be made for both. To make perfect pro- 
vision for equal rate of progress would require a knowledge 
of each pupil's interest, industry, physiological limit, etc. 
Fortunately a simpler method is available which will prove 
sufficiently accurate for practical purposes. 

Classifying by single subjects rather than a cross section 
of all subjects does help some but not greatly. Any one 
subject in the elementary school is divided into a multitude 
of subordinate mental traits which may or may not be 
psychologically akin. There is as much difference between 
different parts of geography as between geography and his- 
tory. The correlation between ability in addition and ability 
in subtraction is not much closer than the correlation be- 
tween ability in addition and ability in grammar. 

The above are the legitimate objectives of classification 
together with the justifications for these objectives. But 
there are illegitimate objectives to be guarded against. It 
is difficult to improve upon Judd's ^ summary of them. 

"Sometimes the school allows a pupil to move up a grade 
or class, although it is known that he has not done the work 
below, because the parents of the child have influence and it 
does not seem safe to antagonize them. 

"Sometimes the pressure of numbers in the lower grades 
or classes is so great that the teacher sends a pupil on in 
order to make room for the younger pupils, even when it is 
evident that the pupil will not be able to carry the higher 
work. 

"Sometimes the teacher in a given grade is anxious to 
unload the backward or disorderly and therefore incompe- 
tent pupil on someone else, and since the open road is into 
the next higher grade, the child is sent on. 

"Promotion is sometimes controlled by the calendar. Be- 
cause the date for closing the schools has arrived, and the 

^ Charles H. Judd, Introduction to the Scientific Study of Education, pp. 109- 
iio; Ginn and Company. 



Measurement in Classifying Pupils 21 

long vacation is at hand, pupils are declared to have com- 
pleted the work whether they have or not. 

' 'Sometimes it is more or less explicitly argued that the 
backward pupil is larger than the other children of like 
intellectual attainments and he should therefore be sent to 
the upper-grade room where the seats are larger." 

Relation Between Mental Age and Quality of School 
Work. — The close relation between mental age and qual- 
ity of school work is now never questioned. When any 
pupil fails to make satisfactory progress in his school work, 
the first step toward finding an explanation is usually to get 
a measure of the pupil's mental age. There is a substantial 
correlation between a teacher's mental-age rating for her 
pupils and her marks upon their school work. This is to 
be expected,, however, for the excellence of a pupil's school 
work is the chief means the teacher has to estimate the 
pupil's mental age. But the same close relation is also shown 
by objective tests of intelligence. Terman reports a corre- 
lation of .725 between mental age and the quality of work 
in the first grade. McCall ^ reports a correlation of .78 in 
the sixth grade. Dickson ^ studied five first-grade classes 
and found that only 2 of the 33 retarded children had 
normal mentality. He concluded that, while there may be 
contributory causes, low mentality is undoubtedly the chief 
cause of the retardation of 31 of these 33 children. 

Not chronological age, physical size, and a variety of other 
criteria, but ability to do the work is the real criterion for 
classification. Terman, Dickson, Whipple, and others have 
shown that a pupil's mental age is an excellent index of the 
quality of work a pupil will be able to do. In his study at 
Urbana, Whipple shows that mental tests give a more 
accurate classification than teachers' judgments and school 
marks. It therefore seems an inevitable conclusion that 

2 Wm. A. McCall, Correlation of Some Psychological and Educational Measure- 
ments; Bureau of Publications, Teachers College, N. Y. C, 19 16. 

\ See Lewis M. Terman, The Intelligence of School Children, p. 64; Houghton 
Mifflin Company, 1919. 



22 How to Measure in Education 

classification by mental age or its equivalent is superior to 
classification by teachers' judgments. Hence, Terman sug- 
gests that intelligence tests be given to all pupils, and that 
each pupil be placed in a grade closely corresponding to his 
mental age. 

Relation Between Mental Age and Present Grade 
Position. — The best estimate is that 25 per cent of the 
pupils in any grade belong mentally in a lower grade and 
25 per cent in a higher grade. In sum, there is both too 
much acceleration and too much retardation — too much 
acceleration of the stupid and too much retardation of the 
intelligent. Data collected by Terman * shows not only that 
almost every grade contains pupils with mental ages rang- 
ing from eight to fourteen but also that errors in classifica- 
tion bear more heavily upon bright pupils than dull pupils. 

Still further evidence of the extent to which the bright 
pupil is penahzed is found in Table i. Strayer finds that 
the total chronological over-ageness is 33.5%. The other 
two columns show that this per cent is not exaggerated. 
When this 33.5% over-ageness is contrasted with 4.3% 
under-ageness some conception may be gotten of the injus- 
tice being done to young pupils with high mental ages. Late 
entrance, absences, and the fact that the school is adjusted 
to a higher than 100 I.Q. child explains part of this differ- 
ence between over-ageness and under-ageness. Due to these 
causes, the per cent of under-ageness should not be expected 
to equal the per cent of over-ageness, but the two per cents 
should approximate each other. There are as many pupils 
whose mental age is above normal as below normal. Dick- 
son has shown that the chief cause of over-ageness is a 
mental age below normal. Hence a mental age above nor- 
mal should mean a corresponding under-ageness. But no 
investigation has revealed anything like a corresponding 
under-ageness. 

^ Lewis M. Terman, The Intelligence of School Children, p. 26; Houghton 
Mifflin Co., 1919. 



Measurement in Classifying Pupils 



23 



TABLE I 



Retardation and Acceleration in Towns and Cities (Adapted from 
Stray er, Morton, and Salt Lake City Survey Report). 



Amount of Retardation 


Strayer ^ 


96 Cities and 
Towns of 
Nebraska ^ 


Salt Lake 


or Acceleration 


318 Cities 


City 


Over-age i year 


19% 


16.3% 


26.7% 


Over-age 2 years 


9.5% 


7.6% 


11.2% 


Over-age 3 years 


3.8% 


^'^l 


3'7^' 


Over-age 4 vrs. or more 


1.3% 


1.4% 


1.2% 


Total over-age 


33.5% 


28.6% 


43% 


Total under-age 


4.3% 







Since the process of classifying by mental age and I.Q. 
is similar to the process of classifying by educational age 
and E.Q. the former may be discussed with the latter. 



II. Classification by Educational Tests 

Measurement of a Sample School. — Recently I was 
asked to bring my class and measure on one day all the 
pupils in a small school. Educational tests were to be used. 

The first step in carrying out this project was to select 
the tests to be used. The tests selected were Thorndike's 
Reading Scale Alpha 2 both Parts I and II, Thorndike^s 
Visual Vocabulary Scale A-2x, Trabue-Kelley's Language 
Completion Scale, ten words from each of six different 
columns of Ayres' Spelling Scale, Woody's Addition, Sub- 
traction, Multiplication, and Division Scales, Series B, and 
Composition scored by the Nassau Extension of the Hillegas 
Scale. No test was used which could not be given to any 
pupil in the entire school, and which did not measure an 
important phase of the school's work. 

When tests are to be used for reclassifying a school the 

^ George D. Strayer, Age and Grade Census of Schools and Colleges, Bulletin 
451, p. 144; U. S. Bureau of Education, 191 1. 

^ W. H. S. Morton, "Retardation in Nebraska"; Psychological Clinic, Dec, 1912, 
and Jan. 19, 1913. 



24 How to Measure in Education 

beginner will do well to select tests according to the follow- 
ing special principles: 

1. The test should be uniform for all grades being re- 
classified. Some tests have one form for, say, grades III, 
IV, and V and another form for grades VI, VII, and VIII. 
Unless the scores are comparable from one form to the 
next, and they rarely are, such tests will cause difficulty. 

2. The test should yield a single score. Some tests yield 
both speed and accuracy scores. Such tests serve a useful 
diagnostic purpose, but the beginner is likely to experience 
difficulty in attempting to employ such tests for reclassifica- 
tion. 

3. The test should measure an important phase of the 
school's work. Unless the school is to be classified by sub- 
jects, the different tests should measure different subjects as 
a rule. 

The second step was to select and train examiners. They 
were trained by having them actually apply under obser- 
vation the particular test assigned to them. 
' The third step was to apply the tests according to stand- 
ard procedure. To prevent a colHsion between examiners 
the plan shown below was devised. 

As an examiner was detailed to a class the number of the 
period under the appropriate grade was circled. Just as 
soon as an examiner finished his test he returned to the 
central office, reported his completion and a cross was drawn 
inside the circle. Immediately afterward the examiner 
whose turn was next was sent to the unoccupied class. A 
chart like this shows the director at a glance, what classes 
are and are not occupied and just whose turn is next. 
Observe that the testing periods are numbered from i to 
7 in the vertical column under Grade III. This is because 
there are more tests than classes. Had there been seven 
classes and only six tests, the testing periods would have 
been numbered from i to 7 horizontally opposite Test I. 

In accordance with this plan every pupil in the school 
above Grade II was tested. 



Measurement in Classijying Pupils 25 

III IV V VI VII VIII 

Reading Test I i 2 3 4 5 6 

Completion Test II 2 3 4 5 6 7 

Add. and Sub. Test III 3 4 5 6 7 i 

Composition Test IV 4 5 6 7 i 2 

Mul. and Div. Test V 5 6 7 i 2 3 

Vocabulary Test VI 6 7 i 2 3 4 

Spelling Test VII 7 12345 

The fourth step was to score the tests and to compute 
pupil scores. 

The fifth step was to tabulate pupil scores. The scores 
are shown in Table 2. The detailed tabulation by test 
elements, which was sent to the teachers, is not shown. 

The sixth step was to compute the median (later chapter) 
score on each test for each grade. These are shown in 
Table 2 just above the grade medians for the preceding year. 

The seventh step was to tabulate norms for each test 
and grade. These are shown in Table 2. A few of the 
tests were standardized for mid-year. But our tests were 
given at the end of the year. Hence before any fair com- 
parisons could be made it was necessary to alter the norms 
to fit the end of the year. The following shows an approxi- 
mate method for converting the mid-year norms for a test 
into June norms. 

Ill IV V VI VII VIII 

Mid-year norm 10 14 18 21 23 24 

June norm 12 16 19.5 22 23.5 24.5 

Computation of Composite Scores. — The next step 
was to compute a composite score for each pupil. As shown 
in Table 2, the first pupil, Ant., made the following scores 
on the nine tests: 3, 35, 58, 31, 11, o, 3, 3, 2.8. A com- 
posite of these scores could be made by the simple process 
of summing them. The sum of these scores is 146.8. The 
composite score as computed by us, however, is 91. To 
sum scores just as they stand is to give the score on the 
spelling test twice as much influence as the score on the 
reading test, and the score on the vocabulary test thirty 



26 



How to Measure in Education 



OH 



U 



be cj 
<^H 

ut 

o 51 






!a 









> 



o W M 



I— I o tn 
00 <^ 



0^" 



tn 



U 



a 



^1s 

C0.2 ° 

1^ o ., 
.S '2 

C "U <« 



OJ 












U 



a, 



U 



rt 



m<t^irt^^Ta'rfc«5't*rJ-TfTh'<i'Tj'«m*n 



w OsOO PO O O 00 00 t>.^ 00 0\MD t^ 0^ t^ On O On 



M On t^ CO io\0 ONioONf^'-'OO mcoco f^\0 00 M 

^^Ol-^^N^Ol-^ol-(ONOl-^'-lO'-lOOON^^^o 



■-I ■^NO N Tj-rOTt-MOO 0\0 0^•* inOO 00 O 00 VO 
ONior^ON'^r^iot^O) c<5N£) t^ rf>0 co co N O m 



00 00 00 M 00 00 00 On M OnOO w 00 On w O 00 00 



IT) 1-1 



f^inrointOfSCvofOr^Mvn^NlroO wrO'^N 



corj-o t^fO'<i-M 'to ■^Tt-fOmirjfOO M fOl^ 



OOOOMOOOONOVOC^rhOOOOO 



w 1^5 r^ tx On ONVO hi no O VO t^vO NO m ts, vrj O fO 



M cc rj- t>.00 POO»nNM00*-<O"1MMC<t-<M 



00 mvono O t>.ONN i^roONONONONO ONt^i-i t>. 
LnMMioCNlN<N)i-<MiHwMfO»-ip)>-i OOOn 



o; 



tn 4/ 

O M 



1/5 fO in O rO •* t^OOVO lOVO moo O w cn^O <*5 «^ 



rO>nr>^ 0\N 0t>.0»0f0i-< O cC'<J-'4tOPOCCOn 



P^* 






Oni-i w t^ fO<N) o OnOn'^'-' NOO Ov-^ro-ThONt^ 
OwcoONQi-ifSpccOiOfONNO'^OroOi-i'* 






0\ in 
CO to 



*> o_ 

6 NO 



t>. ON 



in NO 
fO On 



in in 

NO 00 



in o 
in 06 



On 00 
On 0^ 



O) <U o 



in in in m m 



00 ONOO 00 00 

CO M N pj ^0 



in \rt\0^ m 



t^ •"St On On O 



O t^ OntJ-OO 



N 1* ro On m 



ceo >-i vo 



^ 



t^NOOO o^ 
■* «! moo ■* 



OnCO O O fO 









i-J'^ rt ^ „t 



ONO tx On •* 
CO Tt ■* M M 
M i-i 1-1 M 11 



Measurement in Classifying Pupils 27 






10 in »n t^ irt 10 ^vo 1/5 ^s m 1/5 






NOVO NO NO t».NO NO so NO VO txmVOVO 






O\C0 ONwOO ONO\<^ ON" OstN. 


t^ 

ON 




<^ lOMOO c^ (^ tvinrl-OMnThcot^ 

w 11 M ONOO t^ Cq 00 ON 

MMMIIWM MM M 


in 



w w 


3| 


t^ inoo t^iofOinN ONfOco 


10 t^OO 
0) c^ <s 




r)- in (Nl cooo ONONONONt^-*ONinM 
•*CO'*'*^rO<^ COcO-^inn CO-* 


w ON w 

* t * 










^1 


OOOOt^WOVOMi-HioONON 

M H M 1-1 M 11 


m 1^ 

On 




co-^o rj H coO\M cooNin<^ -^O 
■*N -^-^incoo coco -^NO 00 fq ■* 


00 in t^ 

CO in CO 


a 




00 00 00 00 00 00 ONoooooooooo 


00 NO 




ooooooo 000 00000 onoo 


woo 


inM eq cs incoinincofOinM com 


Tl- CO CO 


> 

Q 


»OCOTfO\« M t^Tf t^OO «0 


in CO m 




ininTl-coOO •^'^•^•^t^ON'*'!-'* 


ON in 


3 
^ 


M IH l-t HI 


in t-» 

l>.No' ts. 




t^ O\00 00 cooo N n NO t>w 11 

M H W W 11 


tx CO 

06 0> w' 
w 


^ 
s 
w 


■*'*0\0 lOMtxirjo OnOO 


CO ■* q 
t^ tCoo 




CO 00 00 00 NO txOO ON n On On On 
11 n 


00 N 

06 d 0* 




t^N w cq CO m\ci r* ro r)- ON «^ 

HWHI-IW HIMII ll 


NO (Nj q 

W 11 




COOinOcocooOcocOOONwW 


W M M 


a, 




inNO T|- 
wNO d 

C) CO CO 




ON NO 11 t^ CO N 00 Th OnOO CO 01 11 00 

o)Ococo'*co<^lcocqcO'<4- c^cq 


inN.oq 
d in JN. 

CO t^ CO 


y 



> 


■^VO ro >-i fOOO 11 vo u^ to 10 

11 H-t 


in in 
11 CO in 
in mNO 




NO •>* ■* ■* -^OO 00 00 t^ CO t^OO CO 
ONOO 000 txNO NO 00 in 11 ■* t-» l^ 

11 M W t1 


in q q 

w N CO 

00 00 00 


e 



ON OvOO ttOO vo ON ■* tc 


q CO q 

On On in 
t>4 5^ ^^ 




00 NO 000 cooo (N| NO 11 Ov m -^NO 

coco-*Tt-'*cocon-co'*cocNi coco 


in CO in 
06 t^ d 

CO CO CO 


Pi 


inOO w O\00 vnOO 11 N Ix <M 

nil CNl M MI1(\| M 


00 q q 
0' N in 




inoo t^OO M NO Onno tx '^ CO On tx On 

MMHMfqilUllwCqM WW 


q rx q 
00 t>. d 

w w CNl 


*> 

> 
bo 


-«' ^ c ^-r' ^* u rt -• C ^ nj 
■* CO t^ ON m\o in M \o NO 

f* -^N cO-^roPq M coco CONO 


ON 00 

M 11 

On On 
ti n 

'd'd £ 

<U (U 


tn (U 


< pq u fn W M i-J J ^ c/2 c« wc/3 H 

ONONO\f^ NNO cocoCNincoCNl inco 
cq cq CO CO CO CO fONO 00 CO M ■* ■* co 


OvoO 

w ti 

ON On 

W HI 

• • s 

<U (U 



28 



How to Measure in Education 



u 



CO O t^ in 0\ l>«00 1^00 t^ txOO iC tNOO bs^ »^ 



a 



H, 



i-iir)OtxiHO\00wfCOO\O'^O00P«P0'* ^^ 
NfOOt^POO\OO>-<OCvt-i^ONi-'O00O O 



vo o\ Tf ^^ 00 >" t^ "^^ ^ •'t^ " Tt r^ -Nf <^ in 



6 2 
o 2 



00 M ioo\0^ O\"^00 ^voOOOO coOsmM t^ 






s 

o 

u 



OOOOOOOOOOOOOOOO O\00 O O 00 o 



O Oi t>» moo t^wOOOO T^fO^-" rOfOO TfP^ o 



rOt^'-'t^OfOTj-f^NOioi-iTj-t-^O r^\0 ON 



00 Tf O t^OO O N 1-1 O 0\ On <^00 0\ w 00 O On 



i>.oq vo 

■4 fO CO 



iN.00 o 

O 0\ li-) 



q in o 
don 



tJ-oO •* CO in if)\0 00mfC(N)-rl-O\M'N)Oi-ifO 



M mo\i-i T)-TfONt^t^ONO r^ QNO rxTh<^ m 



> 



O^NO POO wo '-' inio-^O ■^0\'-i OnO Ont}- 
OrqO\mi-iO\"00'-'OM'-'f*^ONi-<t>.iot-" 



MO O rv t-» On a^00 in N TtOO VO On fO Ov fO ^» 






in q o 
N inoo 

TffO CO 


fO Ov "t 
N w M 









tj nJ-T? § 8 c 5=5'° ti -^' +'■ ^ a>"S.rH- >. o^' 



in N o\ fooo NO t^oo '*No o o ot>.^^•r^l^p-^ 

CO ■^ ■^NO CO m m .^^ Tf Tj- lo inOO m ■^ 'd" in m 



ON 00 

M M 
ON ON 
l-i M 

<u <u a 



0\N0 0\ OvMD ts O\00 00 O t^ M 



Ot^Owco(N(<^fOONONNO\ 
O c>) O o O sD O Onno O in fNi 
t^i-iMNMwCNli-iMrqHic^ 



qoq q qoooo o ooooooo o 

m pq NO NO CNi (^\o ir> rh n CONO 



"inop^corxOMworj-co 



NNomm'^inTi-'^iH MoOno 



^^oo^^^^lnl-<M^^l-CTJ•NOl-| 



in (N) t^NO t>.c»:>f^ininTtcoO\ 



^ 



^ 



On t>. On w 00 00 invo o 1^00 00 
mi-iin-^ co'^incoin-'i'in 



NO ON 0\ t>.NO w On OMn M tJ- tx 
incOTt-in-*-*in-*Tj-tx t^no 






o be 






■* Tt- txOO 00 1-1 <N) Ov^H^txco 
00 00 NO 00 CO t^ 0\N0 NO 00 ir)\^ 



I 



Measurement in Classifying Pupils 



29 



< 



000000 I^OnOOOO 



ao\ t^ o\ ^^ "^ t^ '-' I t^ 2 
O' ON 000 O 0\ O On O 

W - MM 



w. 



CVl 0^ M ThVO o <o J? vS v5^ 

MMWlHl-ll-lW MMM 



0\ M M PO^ in CO 00 moo 
t^ONOs^Ost^ON 00 t^oo 



u 



o o O'oo o q q 
in\o in CO NO NO in 



O CO M 

IT) in ^ 



NO 0000 CO 000 M 



i-i CO q 
6 d\ CO 



c^ c^ Ti-oo'O in x^ 



w w q 
4 M t^ 



t/2 



PI C^ CO O m CO M 



t^ NO q 
pj d CO 



a, 



TfOOOO CO in t^ 10 



in t^ O 
in CO06 



00 t^ OM3 COOOOO 
O ON M 00 w 00 I-I 



« 



t>.00 t^ O CO O CO 

P< «N) o< c^ P< r^ 0) 



in O O 
00 M 00 



CO in O 

oono d 



vo in o 
^ C006 

01 0) PJ 






2 bO 



^ 



• - J5 O y u C ►> 



On 00 
On On 

IH l-t 

• • B 

w o) o 





0\OnOvO OnOOOO Onm mQO 

w WW 










in NO CO CONO •* N NO .+ 

ONoo N ot->ONO c^ 000 

M WW 1-1 W W 


CO . 



w w 








ON 0\ w ^00 in ^ t>.NO 00 m 
t^NO t^ONt^inNOOO in 

WMl-llHMl-ll-tl-l04CN|l-l 


00 CO 

J^ONOO 

www 








00 i-< t^ t^ f^ •T)-00 C^ t-^ 
ON ON ON w On NO 00 CO CONO 
i-(i-ii-iCNli-iwi-iC<CN104w 


TfNO ON 

w On 
0^ M w 








N q q N q c^_ 

NO NO NO NO txinvONO t^NO t>. 


cq qoq 

NONO tJ- 


Ovco PJ 
pi in w 


m 




On Tf'O O\00 N CO in 

1-1 w 1-1 1-1 I-I 


oooo q 
On d ■* 


On w_ 0_ 
CO On t>i 


- 




OOi-i00mt^(N1COt>. t^OO 

WWl-ll-ll-l WWI-ll-l 


in CO q 
c>i Ttoo 

www 


w m rs. 
in p5 CO 

w 


H4 




cOO<NiTi-coO<N^NrfcoON 


00 CO in 

w w w 


N ■* w 

in w CO 

w 


W 




NO "N 00 00 •* ^ in inoo 00 ■* 


00 00 in 

in inoO 
www 


CO M m 

d in pj 
w w 


W 




t^OO Ti" COOO 00 On t^ in t^ 
rt-inininmcoo( in iono in 


55.5 

75-7 
54.4 


CO N t>. 

r^ On m 

w rf w 


CO 




00 ON t^ ■* OnOO CO 0\ m 
Oi-iOn<MOOOnC^C^i-'On 
WW www www 


in 
d\ tA. ts. 

w w 
www 


m t^ w 

t>. On M 

■*0 CO 

w 


in 

w 




Ix in On t^ m in w t^vo On 
in in -1^ in tJ- i4" in ir^^O co 


in CO in 

m in CO 
in in .^ 


q\q NO 
t^od d 

P) .* w 


w 




inNO 0000 inin'^t^inMCO 

PlNC^NC^(N(N0q(N<COM 


Ovoo q 

1006 d 
0) fq CO 


tN.in On 

d TfNO 

w N 


w 


Mos. Gr. 
Age VIII 


w i^ OsOO <N| Tt (N) ON On t^ •* 
t~NOO ON m t^ 00 l^NO ONOO 
wwwwwC^wwwww 


Med. 1919 
Med. 1918 

Norm 


OCsOt 


m 

.w 

1^ 



30 How to Measure in Education 

times as much weight as the score on the composition test. 
Competent judges are substantially agreed, however, that 
reading is certainly not less important than spelling, and 
vocabulary is not thirty times as important as composition. 
It is a very common practice, however, to sum scores just 
as they appear without giving any thought to this matter 
of weighting. 

Tests should be weighted according to the variability of 
their scores. They should not, as is frequently supposed, be 
weighted according to the size of their scores. Which of 
these two tests exercises the most influence upon the com- 
posite? 



ipil 


Test I 


Test II 


CompG 


a 


6 


403 


409 


b 


10 


404 


414 


c 


2 


403 


405 


d 


I 


402 


403 



Test I has more weight than Test II because the varia- 
bility of the scores in Test I is greater than the variability 
of the scores in Test II. The scores in Test I range from 
I to 10; the scores in Test II range only from 402 to 404, 
or, what is equivalent, from 2 to 4. If we multiply all the 
scores of Test II by 5, the range will be increased, and 
will then range from 2010 to 2020, i. e., 10 points. If we 
divide the scores of Test I by 5, or what is equivalent multi- 
ply them by 1/5, there will be a corresponding shrinkage of 
their variability. 

The actual process of computing the composite for the 
scores of Table 2 was as follows: 

1. The scores on the reading test of all the pupils in the 
entire school were thrown into a frequency distribution. 
(Later chapter.) Similar distributions were made for each 
of the other tests. 

2. The Qi , Q3 , and Q were computed for each test. 
(Later chapter.) They are shown at the end of the table. 
The Qi and Q^ serve no other purpose than to yield Q, which 
is a measure of variability. The Q's show that, if the scores 



Measurement in Classifying Pupils 31 

were summed just as they stand, the vocabulary test with its 
Q of 3 I.I would have greatest weight; and composition with 
its Q of 1.2 would have least weight. 

3. The multipliers shown below the Q's were selected so 
as to re-weight the tests in rough accordance with my idea 
of how the tests should be weighted. The following shows 
the Q's, the multipliers and the new weight or new Q given. 

Tests I II III IV V VI VII VIII IX 

Q 6.9 10.6 31. 1 15.7 2.5 3.1 3.7 2.6 1.2 

Multiplier, i i 1/5 1/3 i i i i 5 

New Q . . 6.9 10.6 6.2 5.2 2.5 3.1 3.7 2.6 6 

The completion test or Test II was given most weight, 
not because it was considered more significant than the 
reading tests, but because it was a saving of labor to leave 
its Q unchanged. This completion test is a reliable and gen- 
erally excellent test. It is one of the best intelligence tests 
and it was desired that the composite be a good index of 
intelligence. It was preferred to run the risk of giving it 
too much weight than of giving it too little, especially when 
leaving the Q unchanged reduced labor. Reading and com- 
pletion could have been given identical weight b}'" dividing 
10.6 by 1.5. But guesses at the weighting which tests 
should receive will necessarily be so inaccurate that it is 
foolish to increase one's labor by using as multipliers or 
divisors other than whole numbers. Reading or Test I 
received the next largest weight. Vocabulary or Test III 
was given slightly less weight than reading, and this is prob- 
ably as it should be. Composition or Test IX was given 
about the same weight as reading and vocabulary. Spelling 
or Test IV, being only an element of composition, was given 
slightly less weight than Composition. The tests of arithme- 
tic fundamentals or Tests V, VI, VII, and VIII, were given 
individually the least weight of all. This was not because 
arithmetic is unimportant in scliool work, but because these 
tests were four in number and even then measured only a 
small section of the work in arithmetic. Since there were 



32 How to Measure in Education 

four tests, arithmetic actually received a total weight of 
11.9 or (2.5 + 3.1 + 3.7 + 2.6) as against 23.7 or (6.9 + 
10.6 + 6.2) for the tests which may be classed as reading 
tests. These reading tests were given a greater combined 
weight than the arithmetic tests because reading is pre- 
requisite to more of the total work of the school than the 
fundamentals of arithmetic. As many of these questions of 
weighting as possible should be settled by the refined, yet 
laborious, technique of partial correlation and regression 
equations. 

4. All the pupil scores, grade scores, and norms for each 
test, were multiplied by the multipliers selected for that 
test. This means that nothing at all was done to the reading 
scores since their multiplier was i. The same was true for 
completion. The vocabulary scores were divided by 5; the 
spelling scores were divided by 3; the arithmetic scores 
remained unchanged; and the composition scores were mul- 
tiplied by 5. All these products and quotients do not appear 
in Table 2. In the original computation they were written 
between the columns of the table in red ink. 

5. The weighted scores were summed to get a composite 
score for each individual, a grade composite, and a norm 
composite. The following illustrates the fourth and fifth 
steps for pupil Ant. 

Corn- 
Test I II III IV V VI VII VIII IX posite 

Ant 3 35 58 31 II o 3 3 2.8 

Multiplier i i 1/5 1/3 i i i i 5 

Weighted Score.. 3 35 12 10 11 o 3 3 14 91 

Transmutation of Grade Norms Into Age Norms. — 
Table 2 shows the educational age and E.Q. not only for 
each pupil in School X but also for the medians and norms 
of each grade. How in detail were these computed? It 
was impossible to compute the educational age of a pupil 
on any test until age norms were determined for the tests. 
Unfortunately it is the custom to report grade norms but not 
age norms for educational tests. No age norms being avail- 



Measurement in Classifying Pupils 33 

able, it became necessary to transmute grade norms into 
age norms. The third-grade norm on the spelling test 
reported in Table 2 is shown to be 19.6. Suppose it is 
known that the average chronological age of all third-grade 
pupils is 9 years. Knowledge of this would permit the con- 
version of the grade norm into an age norm. It could be 
said that the norm in spelling for average nine-year-olds is 
19.6. In similar fashion all the grade norms could be con- 
verted into age norms. 

The above conclusion may not be exactly true. Just 
because the median age of all third-grade pupils is nine 
years and the median score in spelling is 19.6, it does not 
necessarily follow that if all the nine-year-old pupils scat- 
tered through several grades were tested their median score 
would be exactly 19.6. There is, however, every reason to 
believe that it would be approximately 19.6. This method 
of transmuting grade norms into age norms assumes that 
the two would be identical. The method is a temporary one. 
It should be discarded as soon as properly determined age 
norms are available. 

A rather thorough search did not reveal any widespread 
study which gives for the United States the average or 
median chronological age of the pupils in each grade. 
Enough data has been found, however, to permit the com- 
putation with a fair degree of accuracy of the average age 
of the pupils in each grade. Ayres ^ gives a frequency dis- 
tribution showing the age of entering the first grade of 
13,868 pupils who were about to graduate from the eighth 
grade in 29 cities. It has been computed from this table 
that the median age of entering first grade is 80 months. It 
is barely possible that pupils who remain to graduate from 
the elementary school tend to enter earlier or later than 
children in general. Any such difference if it exists at all 
is probably slight. Hence it is assumed that the median age 
at which pupils enter the first grade is 80 months. 

"^ Leonard P. Ayres, The Relation Between Entering Age and Subsequent Prog- 
ress Among School Children^ Bulletin No. 112; Russell Sage Foundation, N. Y. C. 



34 



How to Measure in Education 



At least three studies show how long it takes the average 
pupil to complete each grade. In the above study Ay res 
found that the average time for the average pupil to com- 
plete each grade was 12.8 months (including vacation). 
Terman ^ found the average grade interval in terms of 
mental age to be 12.6 months. While acting as statistician 
for the Psychology Committee of the National Research 
Council in the preparation of the National Intelligence Tests, 
Kelley determined that the average time required for the 
average pupil to pass from one grade to the next was 13.2 
months. It has been concluded, therefore, that the average 
time required for the average pupil to pass from grade to 
grade is roughly 13 months. 

Having determined the age when the average pupil enters 
the first grade, and the average number of months required 
by him to pass from grade to grade it was possible to con- 
struct Table 3 for computing educational ages. The second 
and third columns of Table 3 may be used for any set of 
tests. The first column will depend upon the particular 
tests selected. 

TABLE 3 

The Norm Composite for, and Average Age in May of Pupils 
in Each School Grade. A Table for Converting Pupil Composites 
Into Educational Ages. 



Norm 


Average 


Grade 


• 


Composite 


Age 




(est) 


89 


I 




35 (est) 


102 


II 




71 


115 


III 




107 


128 


IV 




137 


141 


V 


i 


i6s 


154 


VI 


3 


188 


167 


VII 




199 


180 


VIII 




216 (est) 


193 


IX 




230 (est) 


206 


X 




246 (est) 


219 


XI 





^ Lewis M. Terman, The Intelligence of School Children, p. 94; Houghton 
Mifflin Co. 



Measurement in Classifying Pupils 35 

The above table should be interpreted viz: Reading from 
right to left, pupils in Grade I have, in May, an average 
chronological age of (80 + 9) or 89 months, and the norm 
composite for Grade I is estimated to be zero. The average 
age for Grade II is (89 + 13) or 102 months with an esti- 
mated composite of 35. The average age for Grade III 
is (102 + 13) or 115 months with, not an estimated but, a 
known composite of 71. The average age of each succeed- 
ing grade has been determined by adding 13 months to the 
average age of the preceding grade. All the composites 
beyond the eighth grade are estimated. 

Grade I is given an average age of 89 months instead of 
80 months because the norms for all the tests shown in 
Table 2 are either May norms or have been computed for- 
ward to May. The average pupil in the third grade, say, 
was 115 months old when the norms for these tests were 
actually or arbitrarily determined. The interval between 
Grades VIII and IX (first year high school) is probably 
nearer 14 than 13 months, but in the absence of exact in- 
formation it was preferred to keep constant the increment 
of 13 months. 

Grade I was assigned a norm composite of zero because 
the average first grade pupil would probably make a zero 
score on these tests. It was estimated that the norm com- 
posite for Grade II was roughly half-way between zero and 
the norm composite for Grade III which is 71. The norm 
composites of 71, 107, etc., through 199 are the last numbers 
under each grade in the column headed Composite in Table 
2. Each of these numbers is the weighted composite of the 
norms for the grade in question for all the tests. Any com- 
mon sense method might be used to estimate the norm 
composites for grades beyond the eighth. The increase of 
Grade VII over VI and of Grade VIII over VII were aver- 
aged, the resulting 17 was added to the norm composite of 
Grade VIII, namely, 199. The composite of Grade X was 
found by averaging the increase of Grade IX over VIII 
and Grade VIII over VII and adding the resulting 14 to 216 



36 How to Measure in Education 

and so on. It was necessary to thus extend the table below 
Grade III and above Grade VIII because there are pupils 
in Table 2 who have educational ages below Grade III and 
above Grade VIII. 

The above table was designed to convert each pupiFs 
composite score into an educational age. It can be used just 
as well to convert each pupil's score on each test separately 
into an educational age or subject age for that test. To do 
this it is necessary to substitute the norm score for each 
grade for the test in question in the place of the norm 
composite. If we were constructing an educational age 
table for the reading test in Table 2, for example, the norm 
scores of 8, 15, 20, 24, 28, and 30 would appear in the first 
column of Table 3 beside Grades III, IV, V, VI, VII, and 
VIII respectively. Appropriate norms could be estimated 
for lower and higher grades. In this way it would be pos- 
sible to compute a reading age and Reading Quotient for 
each pupil in reading and classify the school for reading 
only. By constructing, in similar fashion, as many tables 
as there are tests in Table 2 it would be possible to compute 
for each pupil as many educational ages and E.Q.'s as there 
are tests, and thus classify by subjects if this is desired. 

Again, a median of each pupil's educational ages and 
E.Q.'s on the nine tests would give a final composite measure 
of his educational age and E.Q. respectively. Or again it 
would be possible to determine the median of his nine 
educational ages and divide this final median once for all 
by his chronological age to get his final E.Q. If desired, the 
educational ages could be weighted according to the signifi- 
cance of the tests from which they were derived just as the 
original scores in Table 2 were weighted in computing the 
composite for each pupil. It was decided, instead, to com- 
pute each pupil's educational age and E.Q. just once and 
that through the composite of his original weighted scores. 

Computation of Educational Age and E.Q. — The 
actual process of com_puting educational age and E.Q. was 
as follows: The first pupil in Table 2, namely, pupil Ant., 



20 



Measurement in Classifying Pupils 37 

has a composite score of 91. According to Table 3, had his 
composite score been 71 he would be given an educational 
age of 115, for he would have an educational status equal 
to the average 115-months-old child. Had his composite 
been 107 he would be entitled to an educational age of 128 
months. Since his composite was 91 his educational age is 
between 115 and 128 months. Interpolating, it is found to 

, , r'128 — 115 . . . .1 ,, / i^^^ 

^^ ^^5 +1 —-"^ X (91 — 71) i = 115 +1 ~i X 

L107 — 71 J \ 36 

I = 122. Thus pupil Ant. has an educational age of 122 

months. His chronological age as shown in Table 2 is 109 

^, c^. T- ^ educational age ., . 

months. Smce E.Q. = — — - — — ^ — pupil Ant. has an 

chronological age, 

122 
E.Q. of — or 112. Both his educational age and E.Q. are 

109 ^ 

recorded in the last two columns but one of Table 2. Pupil 
War J. has an educational age of 132 and an E.Q. of 90, 

computed viz.: Ed. Age = 128 +1 ^^ X (116 — 

L137--107 ^ 

107)1 = 128 + (^ X 9I = 132 months. E.Q. = -i^ 

= 90. In this way an educational age and E. Q. were com- 
puted for every pupil and for the norm for each grade. 

The 1 91 9 third-grade median educational age could be 
computed in the same way, or could be found by taking the 
median of the educational ages of the third-grade pupils. 
The latter method was used and yielded a median 01 iii. 
Either method gives approximately the same result. The 
E.Q. for the 191 9 medians could be computed either by 
dividing iii by the median chronological age of the pupils 
or by taking the median of the E.Q.'s of the third-grade 
pupils. For reasons which will not be discussed here the 
two results will not be exactly identical. The E.Q. was 
found by taking the median of the pupils' E.Q.'s. The 
educational age and E.Q. for the norm must necessarily be 



38 How to Measure in Education 

as shown in Table 2 because the computation of all pupil 
educational ages and E.Q.'s assumes that the norm third- 
grade educational age and E.Q. are 115 and 100 respec- 
tively. 

In actually computing pupil educational ages a very much 
finer table than Table 3 was used. The working table 
showed the norm composite corresponding to every month. 
This saves the annoyance of interpolating, because with such 
a table no further calculation is necessary. Educational 
ages are read directly. Compare the following Grade III-IV 
portion of the working table, for example, with this same 
portion of Table 3. 



Grade 
III 



Norm 


Average 


Composite 


Age 


71 


115 


73.8 


116 


76.6 


117 


79.3 


118 


82.1 


119 


84.9 


120 


87.6 


121 


90.4 


122 


93.2 . 


123 


95.9 


124 


98.7 


125 


101.5 


126 


104.2 


127 


107.0 


128 



IV 

The age interval from Grade III to Grade IV is 13 
months. The composite interval is 36 points. If 13 months 
equals 36 points, then one month is equivalent to 2.77 points. 
Hence the table begins with 71 and increases by 2.77 for 
each additional month until 107 is reached. Other grade 
intervals were spaced off the same way. 

Educational Age vs. Mental Age. — Educational age 
when determined by a proper team of educational tests is 
probably superior to mental age for realizing the first objec- 
tive of classification, namely, to bring together pupils of 



Measurement in Classifying Pupils 39 

equal educational status. Educational age is superior to 
mental age for this purpose because it and it alone reveals 
directly what pupils are of equal status educationally.^ 
Educational age measures this directly. Mental age meas- 
ures educational status only indirectly. It has already been 
shown that there is a close relation between mental age and 
true educational status, but there are many forces operating 
to prevent this correlation from being perfect. A pupil's 
educational status is a resultant not only of his mental age 
but also of his health, attendance, attitude toward school 
work, industry, etc. Educational age takes into account 
both mental age and all these other factors which condition 
the quality of school work. Mental age, as usually tested, 
reveals the effect of these other factors but to a less extent. 

Again, educational age is superior because it prevents the 
pupil from skipping valuable portions of the curriculum. If 
the curriculum has been properly constructed most of what 
is ahead is not likely to be so valuable as an equal amount of 
what is behind. 

Finally, educational age is superior because it prevents 
the skipping of pre-requisite portions of ability hierarchies. 
Work in the elementary school is of a rather hierarchical 
nature. Even geography and history have certain pre- 
requisites only a short distance below them. This point 
should not be stressed too much because gifted pupils have 
a phenomenal capacity to fill up really vital gaps. But 
educational age, particularly when it rests upon educational 
tests for the more continuous subjects, does guarantee that 
the pupil will not be handicapped by large gaps in his 
abilities. 

Franzen has demonstrated, in the case of pupils whose 
educational age is markedly below mental age, that by 
specially promoting them and by otherwise applying educa- 
tional pressure the educational age could be made to approxi- 
mate the mental age within one year. It would be interest- 
ing to learn whether this progress could not have been 



40 How to Measure in Education 

secured just as well, if not better, by keeping them at all 
times in the grade or grades closest to their educational age 
and applying the pressure there. 

Mental age is, however, superior to educational age for 
classifying pupils in the primary grades. In the present 
stage in the evolution of educational tests and prognostic 
tests for special abilities, mental age is probably the best 
basis of classification for high school and college freshmen 
also, though some schools follow the practice of determining 
classification on the basis of educational tests of the prog- 
ress made during the first week or weeks of school. 

E.Q. vs. I.Q. — "^he second fundamental objective of 
classification is to bring together pupils who will progress at 
equal rate. Probably the best way to prophesy what the 
rate of progress will be is to find out what the rate of prog- 
ress has been. Educational age and mental age considered 
apart from chronological age tell us almost nothing about 
the past rate of progress. Rate of progress is shown by E.Q. 
and I.Q. If the pupil's intelligence has developed rapidly 
his I.Q. is proportionally high — above loo; if it has de- 
veloped slowly his I.Q. is low — below loo. Similarly, if 
his educational ability has developed rapidly his E.Q. wall 
be proportionally above loo, and if it has developed slowly 
his E.Q. will be proportionally below loo. 

Like educational age and mental age, E.Q. and I.Q. com- 
pare more easily than they contrast. It is their close simi- 
larity that strikes the student first. Fig. i brings out the 
similarity of their distribution. The solid line shows 
Terman's ^ distribution of I.Q.'s for unselected pupils. The 
dotted line blocks out the frequency surface of the distribu- 
tion of E.Q.'s of about 500 pupils in a certain New York 
City school. (This school will be referred to hereafter as 
School Y.) The form of the E.Q. distribution closely 
approximates that of the I.Q. distribution. The I.Q.'s 
center at 100 while the E.Q.'s center five or so points below 
100. This tendency for the E.Q.'s of School Y to be below 

" L. M. Terman, The Measurement of Intelligence, p. 66; Houghton Mifflin Co. 



Measurement in Classifying Pupils 



41 



the I.Q/s for children in general is not surprising. The 
school is below norm on the educational tests, which is 
partly explained no doubt by the fact that on the average 
the children have a heredity which all observers were agreed 
is intellectually below par. Two diagrams comparing the 
E.Q. with the I.Q. for these same children would be almost 
identical. 

A more detailed study of the data from School Y revealed 
a constant tendency for low I.Q.'s to be below the corre- 



:3s 


















30 












TERr' 


AN 1.6 








25 












M«CALL E.Q 


s.^-. 


20 




















- !5 




















10 




















■Is 






























^ , 




BWMMBBtaiPqM 






L^^ 



56-65 66-75 7B-S5 6635 56-105 106-115 116-125 126-135 I36-M5 
M^C/I LL - a66r. 5.2% 18.6% 29.6% 23.6% 12.8% 3.4% 0.22% 
TERIIAN - 0.33% ^3% 8.6% 20.1% 33.97« 23.1% 9.0% 2.37o 0.55% 

Fig. I. Comparison of E.Q.'s for 500 Pupils Tested by McCall with the I.Q.'s 
of Unselected Pupils Tested by Terman. 

sponding E.Q.'s and for high I.Q.'s to be above the corre- 
sponding E.Q.'s. Several possible explanations of this 
phenomenon, apart from the method and error of computa- 
tion, may be suggested. One is that stupid pupils, always 
in danger of failing, receive from the teacher the individual 
attention which not only belongs to them, but also that 
which belongs to the gifted pupils. This would raise the 
E.Q. of the stupid to the detriment of the E.Q. of the gifted. 
Again, educational tests may measure abilities to which 
stupid pupils may be trained by dint of years cf effort. 
IntelHgence tests are supposed to measure transferred 
ability only. Again, bright pupils are not, as a rule, per- 
mitted to go forward as fast they could, consequently they 



42 How to Measure in Education 

get no opportunity to learn the abilities measured by the 
educational tests because these abilities are only taught in 
higher grades. Finally, it is possible that the variability 
of E.Q.'s is less than the variability of I.Q.'s, since the 
variability of I.Q.'s has been increasing since birth, while 
the variability of E.Q.'s has only been markedly increasing 
since school entrance. If this last is true all low I.Q. pupils 
and all high I.Q. pupils would automatically have higher 
and lower E.Q.'s respectively, while average I.Q. pupils 
would have equal E.Q.'s. 

Are Pupils Now Classified by Educational Age? — 
Teachers, educational tests, and mental age are in substan- 
tial agreement that pupils are not classified in homogeneous 
groups. The median educational age for each grade and 
section in School Y was as follows: 

IIIA IIIB IVA IVB VA VB VIA VIB VIIA VIIB VIIIA VIIIB 
100 107 112 122 128 133 143 132 144 144 149 157 

In School Y, VIB is actually behind VIA and there is prac- 
tically no progress at all between VIA and VIIB. 

But the position of the grade medians, improperly spaced 
as they are, does not begin to suggest how bad the classifi- 
cation really is. Fig. 2 permits a comparison of the amount 
of total grade overlapping. At a glance this diagram tells 
us that the extreme range of each grade is about 50 months 
in terms of educational age, which is equivalent to a range 
of about four typical grades, while the interval between the 
two grades is 2.5 months. The range of ability within one 
grade of School Y is then approximately 20 times the differ- 
ence between two adjoining grades. 

But before many assertions can be made about the amount 
of overlapping it is necessary to enquire which of the three 
possible causes of overlapping is responsible. These three 
causes are, {a) the trivial or inadequate nature of the 
mental traits measured, (b) the unreliability of the test, and 
(c) incorrect classification. Is the large amount of over- 
lapping between grades of School X and School Y due to 
the last cause or to the first two causes? 



Measurement in Classifying Pupils 



43 



The tests used in both schools were not of a trivial nature. 
They measured mental traits which are now and ought to 
be central in classification. If two groups ideally classified 
and separated for educational purposes were measured for 
weight, a large overlapping between groups would be found. 
But this would not be an indictment of the classification for 
variations in weight have little significance for educational 



CM t* 



MEDIAN 



VII 



VI 



Fig, 2. 







A Graphic Picture of the Amount of Overlapping of the Educational Ages 
of Pupils in Grades VI and VII of School Y. 



classification. The traits measured in these schools were, 
unlike weight, of vital significance for educational classifica- 
tion. 

While the traits measured were not trivial they were prob- 
ably inadequate. Kruse ^^ has shown that the obtained 
overlapping of abilities from grade to grade is always 
greater than the true overlapping and hence any classifica- 
tion based upon inadequate measures is sure to be too 
drastic. If classification ought to be based upon methods- 
of-work abilities, a charge of inadequacy cannot easily be 

^^ Paul Kruse, The Overlapping of Attainments in Certain Grades; Teachers 
College, Columbia University, New York, 1918. 



44 How to Measure in Education 

placed against the tests of School X. The tests used are 
fairly representative samplings of the methods-of-work 
abilities. If classification ought to be based upon informa- 
tional abilities such as are given by history and the like, the 
charge of inadequacy has some validity. It is easy, however, 
for those engaged in classification to meet this charge, for 
objective informational tests exist and these tests may be 
used if subsequent research reveals that there is not a high 
correlation between method abilities and informational abili- 
ties. 

It is possible to go beyond Kruse and register a charge 
of inadequacy against all classification both old and new. 
Practically ail classification, whether based upon objective 
tests or teachers' judgments, has considered abilities and 
has almost completely ignored purposes. And yet so long 
as purposes together with abilities make up the curriculum 
it may be reasonable to ask that pupils be graded upon their 
possession of purposes as well as their possession of abilities. 
Half, if not more, of what goes to the making of an educated 
individual is now practically ignored in every system of 
classification. Since any attempt at present to measure pur- 
poses is likely to be highly inaccurate perhaps it is well 
that they have been ignored. It is important, however, that 
they not be forgotten, and that scientific students of educa- 
tion continue their efforts to make classification measure- 
ments more adequate than is now possible. 

Overlapping between grades is exaggerated not only 
through inadequacies but also through inaccuracies or unre- 
liability of the measurements. Imagine two regiments, each 
of which is composed of men of identical heights and each 
of which differs from the other by a very small amount. 
If measurers who were ignorant of the present accurate 
classification were given crude instruments for measuring 
height and told to determine the height of the men, the 
results would show an overlapping between regiments due 
to mistakes of measurements. This factor is always oper- 
ating in educational testing to make classification appear 



Measurement in Classifying Pupils 45 

worse than it really is. Kelley ^^ gives the following formula 
for determining how much too large is the obtained 
variability and hence how much too large is the obtained 
overlapping. 

True S.D. — Obtained S.D. V Self-correlation coefficient. 

Thus were the self-correlation coefiident (later chapter) 
of all scores on all tests combined which were given to 
School X .25, i. e., were the self-correlation .25 between 
the composites or educational ages for all pupils in one 
grade and another similar determination of composites or 
educational ages, the true S.D. would be computed thus: 

True S,D. ^ Obtained S.D. V^ = .5 Obtained S.D. 

That is, were the self-correlation no more than .25 the 
true overlapping is only .5 or one-half the obtained over- 
lapping — the interval between grades is twice the obtained 
interval. The self-correlation for the tests used at School X 
is probably sufficiently high to make the obtained overlap- 
ping so nearly the same as the true overlapping as to require 
no correction. The self-correlation for the reading test alone 
is over .8. When all the tests are combined the self- 
correlation is certainly over .9, though it has not been 
computed, and is probably about .97. Substituting it is 
found that the obtained S.D. is almost the same as the 
true S.D. 

True S.D. = Obtained S.D. V"^ =:: .98 Obtained S.D. 

Classification and Placement Table for a Small School. 
— The small illustrative School X has been reclassified. 
The final disposition of each pupil is shown in the last 
column of Table 2. The first step in regrading this school 
was to space off the grade intervals. Only one of the many 
ways for doing this need be described. The median educa- 
tional age for Grade III is 1 1 1 months. The median educa- 
tional age for Grade VIII is 178 months. This gives (178 — 
III) or 67 months to be divided into five equal portions 
for the five grade intervals. This gives 67 -^ 5 or 13.4. It 

^^ Truman L. Kelley, "The Measurement of Overlapping"; The Journal of 
Educational Psychology, November, 1919. 



46 How to Measure in Education 

is reasonable to ask each of the grades from III to VIII to 
do its part in lifting the educational age of its children from 
III to 178 months. Each grade's part is 13.4 months. 

But some may reasonably contend that the school should, 
at least, lift its pupils from their present third-grade position 
of 1 1 1 months educational age to the eighth grade norm of 
180 months. That is, they should lift them (180 — iii) 
-^- 5 or 13.8 months per grade instead of 13.4 months. This 
latter contention has not been accepted partly because to 
make the grade intervals 13.8 would necessitate a rather 
drastic reclassification of the pupils, and partly because it 
is known that this school could not well accomplish the task 
set for it without retarding its stupid pupils even more than 
at present. In sum, the past achievement of the school has 
been accepted as the basis of classification. Even their 
achievement of the past, like the achievement of the norm 
school, has been secured through holding back the stupid 
pupils while the brighter ones went ahead. 

Since the classification of the pupils in School X should 
not entirely neglect E.Q., the second step has been to deter- 
mine the median E.Q. of all the pupils. The median of the 
six median E.Q.'s for the six grades was sufficiently accurate 
for the purpose. 

Ill IV V VI VII VIII Median 

Median E.Q 90 97 105 102 97 103 100- 

Knowing that the third-grade median educational age is 
III, that the grade intervals are 13.4 and that the typical 
E.Q. of the school is 100, it was possible to construct a 
reclassification table for present pupils and a placement table 
for future pupils. In Table 4 each new grade median has 
been made larger than the preceding one by 13.4. To 
facilitate classification the quarter point below and above 
each grade median has been given. Since the tests were 
given at the extreme end of the school year, all pupils listed 
in Grade III, IV, etc., of Table 2 were just on the point 
of becoming Grade IV, V, etc., pupils respectively. And 
since it was desired to state, not in what grade each pupil 



Measurement in Classifying Pupils 



47 



should be at the time the tests were given, but in what grade 
the pupils should be placed the following September, Table 
4 was constructed as though the median educational age of 
III months belonged to the fourth instead of the third grade. 
When tests are given in the middle of a grade it is preferable 
to show, instead, the grade where the pupils should be at the 
time the tests are given. 

TABLE 4 

Reclassification Table for Present Pupils and Placement Table for 
Future Pupils of School X 



Median 

Third Quarter 

First Quarter 

Median 

Third Quarter 

First Quarter 

Median 

Third Quarter 

First Quarter 

Median 

Third Quarter 

First Quarter 

Median 

Third Quarter 

First Quarter 

Median 

Third Quarter 

First Quarter 

Median 

Third Quarter 

First Quarter , 

Median 

Third Quarter 

First Quarter , 
Median , 



Ed. Age 



97-6 (est) 

lOI.I 

107.5 

III.O 

1 14.5 

120.9 
124.4 
127.9 

134.3 
137.8 
141.3 

147.7 
151.2 

154.7 

161. 1 
, 164.6 
; 168.1 

\ 

\i74.5 

178.0 

i8i.5 

187.9 
191.4 (est) 

15^4.9 

I 

2ok.3 
204.8 (est) 



E.Q. 



100 



100 



100 



100 



100 



100 



100 



100 



100 



Grade 



III 



IV 



VI 



VII 



VIII 



IX 



X 



XI 



48 How to Measure in Education 

Rules for Reclassifying Small School. — In reclassify- 
ing the pupils of School X, the following rules were adopted. 
These rules are not entirely arbitrary. They are based upon 
some experience with this method of reclassification. 

1. No pupil will be demoted or denied his normal pro- 
motion unless his educational age falls below the third 
quarter of the grade in which it is proposed to place him. 
This rule gives the slow pupil the benefit of any doubt con- 
cerning the reliability of the tests, and is a slight concession 
to the idea that chronological age has some significance for 
classification. Further, it may not be wise to run the risk 
of discouraging by demotion or failure of promotion a stupid 
pupil who may be doing his best unless it is pretty certain 
that he really needs the work below. 

2. No pupil will be skipped over a grade whose educa- 
tional age does not exceed the first quarter of the grade to 
which it is proposed to skip him. If the classifier cannot 
afford to risk a few failures as a result of his recommenda- 
tions he had better require that the pupil exceed the median 
of the grade to which he is promoted. 

3. No pupil who'^se E.Q. is below the typical E.Q. of the 
school will be permitted to skip one or more grades unless 
his educational age exceeds the median of the grade to 
which he is skipped. If his educational age exceeds the first 
quarter of the grade to which he is skipped and his E.Q. is 
above the typical E.Q. of the school his relative position will 
very probably be higher still by the end of the year. If, 
however, his E.Q. is low he may be at the bottom of the 
class by the end of the year and thus be in danger of failing. 

There are other considerations which ideally should influ- 
ence classification. Besides an attempt to diagnose differ- 
ences between E.Q. and I.Q. in order to discover which is 
the better index of rate of progress, it might be profitable 
to inspect each pupil's record on each test separately. A 
pupil might possibly have a high composite on all tests 
together with a sufficient disability in one to prove fatal. 
Arithmetic and reading should be carefully inspected. For 



Measurement in Classifying Pupils 49 

example, there is scarcely a pupil in Grade III of School X 
(Table 2) who does not have a zero score in subtraction. 
Why this is so is something of a mystery. This single disa- 
bility might prove fatal to the arithmetic work of those 
pupils who have been skipped to Grade V. The double pro- 
motion was recommended only on condition that pupils 
having marked special disabilities be given special help. 

The Reclassification. — The pupils of Table 2 have 
been reclassified according to the three simple, objective, 
and easily applied rules stated above, and in accordance 
with Table 4. The reclassification recommended is shown 
in the last column of Table 2. The first pupil in Grade III, 
namely pupil Ant, has been permitted to skip Grade IV and 
enter Grade V. His educational age of 122 months is above 
the first quarter of Grade V and his E.Q. of 112 is above the 
typical E.Q. of the school, namely 100. The next pupil, 
Bau, has been given his regular promotion to Grade IV. 
He exceeds its first quarter but his E.Q. is below 100. This 
would disqualify him from being skipped from Grade II 
to Grade IV, but does not rob him of his regular promotion. 
Pupil Soh has been kept in Grade III. His educational age 
is below the third quarter of Grade III and his E.Q. is as 
low as 71. Educationally he would probably be lost in 
Grade IV, and what is more important he would rob the 
regular pupils of Grade IV by demanding an undue amount 
of the teacher's time, etc. Pupil Mye of Grade IV has been 
skipped over Grades V and VI. His educational age is even 
beyond the third quarter and his E.Q. is 114. The child 
has been given, not a favor, but simple justice. Pupil Sim 
in Grade V is a doubtful case. He did not get the benefit 
of the doubt. He is still young and the school as a whole 
is slightly below standard. Pupil Era in Grade VI shares 
with Pupil Kim of Grade III the honor of being the most 
intelHgent child in the school. He is probably a genius. 
His educational age and E.Q. qualifies him for Grade X. 
Although the table shows him promoted to Grade X it was 
not actually recommended that any child be promoted be- 



50 How to Measure in Education 

yond the first year of high school, partly because high school 
pupils are of a better intellectual caliber than elementary 
school pupils, partly because high school work is not a con- 
tinuation of elementary school subjects, and finally because 
the recommendation would in all probability have been 
utterly disregarded by the high school to which th*. pupil 
would go. Pupil Kim in Grade VI is also advanced to high 
school. He is a near genius and is a brother of pupil Kim 
in Grade III. Pupil Ant in Grade VI has been skipped over 
Grade VII just as his brother pupil Ant in Grade III was 
skipped over Grade IV. Among others, pupils Bel, Lut, and 
Hen in Grade VII have been sent to high school. They have 
completed the elementary school work better than the typi- 
cal pupil. The action taken in these cases may well be 
questioned, because their E.Q.'s are below loo. They will 
meet a much fiercer competition in high school. Another 
year in the elementary school will probably not, however, 
improve their high-school chances. Since they are over 
fifteen years old there is little hope for much further growth 
in mental age. Pupil Pug in Grade VII has close to the 
highest educational age in the school. He has been skipped 
to high school just as his sister, pupil Pug, in Grade VI was 
skipped over Grade VII. Just as two watches, built at differ- 
ent times by the same master-hand, run together, so these 
brothers and sisters, springing from a common heredity, tend 
to move at closely similar rates through the school. 

In case it is desired to reclassify the school on the basis 
of what typical schools are achieving rather than upon the 
basis of the past achievement of the particular school in 
question the classification table should be derived, not from 
the median achievement of the grades in the particular 
school, but from Table 3. All pupils whose educational ages 
fall below 89 + 102 are Grade I pupils and should be placed 

2 
in Grade II, those between 89 + 102 and 102 + 115 are 



Measurement in Classifying Pupils 



SI 



Grade II pupils and should receive their promotion to Grade 
III, those between 102 + 115 and 115 + 128 should be 

2 2 

placed in Grade IV, those between 115 + 128 and 

2 
128 + 141 should be placed in Grade V, and so on. De- 

2 
cision in doubtful cases should depend upon the size of the 
pupiPs E.Q. In case half-yearly promotions are planned, 
pupils should be placed in the lower half of Grade V whose 
educational ages fall between 115 + 128 and 128, and in 

2 
the upper half of grade whose educational ages fall between 
128 and 128 + 141 and similarly for other grades. 



The Amount of Promotion and Demotion Necessary. 
— The amount of reclassification necessary in this sample 
school, even when the classification has been somewhat con- 
servative, is shown in Table 5. 

TABLE 5 

Distribution of Changes Made in Reclassifying School X by Means 
of Educational Tests. (Data from last column of Table 2) 





Number of Pupils 


Amount of Change 


III 


IV 


V 


VI 


VII 


VIII 


Total 


Demoted three grades 
Demoted two grades 
Demoted one grade 
No change 
Promoted one grade 
Promoted two grades 
Promoted three grades 




2 
13 
4 






I 

13 
I 
2 






I 

II 
2 






2 
I 
8 

5 
I 
I 



2 

3 
7 
5 
I 

I 




3 
5 
I 
2 





4 
II 

57 

18 

6 

2 



Table 6 gives similar data for a larger school — School Y 
— where the technique of reclassification was practically the 
same. The tests used in this school were Thorndike's Read- 
ing Scale Alpha 2, Thorndike's Visual Vocabularly Scale 



52 



How to Measure in Education 



A2 X, Trabue's Completion Scales B and C, Ayres' Spelling 
Scale, Starch's Reasoning Test in Arithmetic, Woody- 
McCall's Mixed Fundamentals, Parts I and II, and Compo- 
sition. 

TABLE 6 

Distribution of Changes Made in Reclassifying School Y by Means 

of Educational Tests 



Amount of Change 


Number of Pupils 


III 


IV 


V 


VI 


VII 


VIII 


Total 


Demoted four grades 
Demoted three grades 
Demoted two grades 
Demoted one grade 
No change 
Promoted one grade 
Promoted two grades 
Promoted three grades 
Promoted four grades 






39 
15 











4 

39 
16 

5 








I 
38 
39 
9 








2 

29 

27 

II 

I 

2 








f- 

45 
12 

3 
I 






I 

13 
36 

9 
I 

3 

I 




I 

25 

226 

118 -* 

29 

5 

3 



The two tables when combined lead to the following con- 
clusions which need to be only slightly discounted because 
of unreliability of the tests: 

1. About 44 per cent of pupils are wrongly classified. 

2. About 34 per cent of pupils are misplaced one grade. 

3. About 10 per cent of pupils are misplaced two or 
more grades. 

4. Only about 8 per cent of pupils are pushed ahead of 
the grade where they belong, while nearly 36 per cent are 
held back from the grade where they belong. 

Success of the Reclassification in School X. — The re- 
classification of the pupils in School X as shown in Table 2, 
was recommended on the following conditions: (i) All 
promotions and demotions were to be trial promotions and 
demotions, and the pupils were to be so informed. (2) 
After four weeks of trial, the principal in consultation with 
the teachers was to construct a series of examinations upon 
the material studied during the four weeks and try these 



Measurement in Classifying Pupils 



53 



tests upon the pupils. (3) The teachers were to rank the 
pupils in their respective classes upon the quality of their 
work during the four weeks. (4) The principal and teachers 
were to decide the final disposition of the pupils. 

The demoted pupils have "made good," i. e., there has 
been no disposition to question the recommendations in their 
cases. Not one has been returned to his original grade. 
What happened to the promoted pupils for whom reports 
are available is shown in Table 7. This table is read as 
follows: Pupil War J., who has an E.Q. of 90, was promoted 
over Grade IV. He ranked, according to the educational 
tests, first among the sixteen pupils who, together with him, 
made up Grade V. He ranked first among the same sixteen 



TABLE 7 

What Happened to the Specially Promoted Pupils of School X 



Pupil 


E.Q. 


Grade 
Skipped 


Rank 

by Ed. 

Age 


Rank 
by Prin- 
cipal 


Rank 

by 

Teacher 


Final 
Dispo- 
sition 


War J. 
War R. 
Kim 
Ant 


90 
108 

134 
112 


IV 
IV 
IV 
IV 


1-16 

4-16 

10-16 

12-16 


1-16 
2-16 
7-16 
8-16 


6-16 

3-16 

9-16 

11-16 


V 

V 

IV 

IV 


Mye 
Sco 
Van 
Hoy 


114 

125 
no 
112 


V & VI 

VI 
V& VI 

VI 


1-16 

3-16 

7-16 

10-16 


3-16 
11-16 

7-16 


5-16 
14-16 
13-16 

9-16 


VII 
VI 
VI 

VII 


Fra* 
Kim * 
Spi 
Lan 

Pug 
Ant 
Mit 


135 
131 
118 
108 

no 
121 
113 


VII 
VII 
VII 
VII 

VII 
VII 
VII 


1-16 
4-16 
7-16 
9-16 

10-16 
11-16 
12-16 


4-16 
10-16 
14-16 
16-16 

12-16 

' 8-16 


6-16 

9-16 

13-16 

16-16 

14-16 
11-16 
15-16 


VIII 

VIII 

VII 

VII 

VII 
VII 
VII 


Pug 


126 


VIII 


3-10 


I-IO 


5-10 


IX 


Average 


6.2-15.6 


7.4-15.6 


9.6-15.6 





* Certain difficulties kept these pupils from being sent to high school as 
recommended. 



54 How to Measure in Education 

pupils who were tested by the principal upon four weeks of 
school work. In the judgment of his teacher he ranked 
sixth. He was finally retained in Grade V. 

Table 7 permits an interesting psychological study of the 
pedagogical mind. The data are too inadequate to allow 
other than tentative generalizations, in order to provoke 
further study. Anyway, the purpose here is not so much to 
draw conclusions as to describe a process. The table sug- 
gests the following: 

1. A specially promoted pupil tends to be ranked lower 
by the teacher's judgment than by the principal's examina- 
tion or by standard educational tests. The averages at the 
bottom of the table show that the average ranks by tests, 
principal, and teacher are respectively 6.2, 7.4, and 9.6 out 
of about sixteen pupils. 

2. A young, specially-promoted pupil must succeed be- 
yond a shadow of doubt or he will be demoted. Pupils Kim 
and Ant of Grade V, and possibly Mit of Grade VIII, did 
better than was originally anticipated and yet they were 
reduced a grade. 

3. A pupil's educational age and E.Q. must at least ex- 
ceed the median of the grade to which he is sent or the 
teacher and principal will probably return him. And it 
should be remembered that the principal and teachers of 
School X were friendly to the experiment. 

4. The school's staff is convicted of injustice by its own 
measurements. Can anyone unacquainted with school tra- 
ditions give a rational explanation of why pupils Kim and 
Ant were sent back to Grade IV? The real fact is that 
these teachers require a young pupil to do, not the typical 
work of the grade, but the best work in the grade. The 
teachers of School X testify that most of the young pupils 
demoted had rapidly risen in rank since the opening of 
school. With their high E.Q.'s it is probable that this 
process would continue throughout the year, thus making 
their class status better and better. 

The teachers explained that pupils Kim and Ant were 



Measurement in Classifying Pupils 55 

demoted, even when their rank was satisfactory, because 
those ranking below them were relativly stupid pupils. 
This factor would not have influenced the teachers had any- 
one been present to explain that while these pupils were 
stupid, they were also much over age. Additional years of 
schooling had balanced their stupidity. While their E.Q.'s 
were low their educational ages were as high as pupils Kim 
and Ant. If the measurements of the principal and the 
judgments of the teachers be accepted at their face value, 
only one pupil was legitimately sent back and that is pupil 
Lan. All the others have paid the penalty of their promi- 
nence and particularly of their unfortunate youthfulness, 
unless it be assumed that the fundamental basis for the 
classification of pupils should be chronological age. 

What happened to the pupils who were sent back? If 
the effect of demotion is to produce sulkers, special promo- 
tion should not be given unless there is considerable cer- 
tainty that the promotion will be maintained. The principal 
reports that one or two were glad to go back to their former 
companions, some did not want to go back, some didn't care, 
every pupil except Lan are at the top or near the top of 
the grades to which they were returned. The principal 
reports that those originally demoted on the basis of the 
tests are happy and satisfied. 

Franzen has since tried the experiment in the Garden City 
elementary school of giving special promotion only to those 
pupils whose educational age and E.Q. or mental age and 
I.Q. both exceed the median of the grade to which they are 
sent. In no case was a pupil afterward demoted. 

Procedure for Reclassifying Large School. — ^The pro- 
cedure recommended for reclassifying a small school has 
one fundamental defect. It secures homogeneity of educa- 
tional status, but it does not guarantee equal rate of prog- 
ress. E.Q., the prophet of future progress, was used only 
incidentally. 

A thoroughgoing use of both educational age (or mental 
age) and E.Q. (or I.Q.) requires a school with enough pupils 



$6 How to Measure in Education 

and teachers to make two or, preferably, three or more 
classes per grade. Given this condition it is possible to have 
parallel groups, some of which either progress more rapidly 
through the grades or else take a wider educational swath. 
This enables the school to keep together pupils with like 
educational ages, E.Q.'s, chronological ages, and social and 
vocational needs. It insures that no pupil will be forced to 
skip vital parts of the school's work in order to progress 
rapidly, that no pupil will lack the training which comes 
from keen competition, that no pupil's vanity will be fos- 
tered by a perception of unquestioned superiority on his 
part, that no pupil will form habits of mental laziness by 
living in a mentally non-stimulating atmosphere and by 
being continually called upon to master difficulties which 
he has already mastered, and, finally, that no pupil will 
be persecuted or taught social timidity by the brute physical 
strength of the chronologically older. Many of these advan- 
tages can be secured through a classification on the basis 
of educational age alone. All can be secured through a 
classification by both educational age and E.Q. 

The steps in the procedure of classifying both by educa- 
tional age and E.Q. follow: 

1. Pupils should be classified into the various grades on 
the basis of educational age without regard to E.Q. It is 
suggested that no pupil be specially promoted or demoted 
unless he exceeds or falls below respectively the first quar- 
ter or third quarter of the grade in question, as was done in 
the case of School X. 

2. When the horizontal grade classification on the basis 
of educational age has been completed, the pupils in each 
grade should be divided into groups on the basis of E.Q. 
If, for example, there have been enough pupils placed in 
the third grade by the first classification to make three 
classes, all pupils whose E.Q. is, say, over no may be 
placed in one group, all whose E.Q. is between 90 and no 
in another group, and all whose E.Q. is below 90 in the third 
group. This process should be repeated for each grade. 



Measurement in Classifying Pupils 57 

The E.Q. points selected will depend upon the distribution 
of the E.Q.'s in each grade, and the number of pupils de- 
sired in each group. The usual practice is to place fewer 
pupils in the superior and inferior groups than in the normal 
groups. Or instead, the pupils in each grade may be 
arranged in order according to the size of their E.Q.'s. The 
highest third can be placed in one group, the middle third 
in another, and the bottom third in another. 

The procedure described for both a small school and a 
large school is identical whether educational or intelligence 
tests are being used. If intelligence tests are being used, 
mental age takes the place of educational age, and I.Q. the 
place of E.Q. If pupils are being classified into grades only 
and not into sections within each grade, neither educational 
age nor E.Q. are absolutely required. Pupils could be 
grouped with reasonable accuracy on the basis of their com- 
posite scores. In this case composite score takes the place 
of 'educational age in the classification table. 

The Placement o£ Future Pupils. — Copies of the tests 
used in classifying the pupils of School X were left with 
the principal, together with standard instructions for apply- 
ing and scoring them. When a new pupil arrived at the 
school, the principal applied these tests, scored them, multi- 
plied the score on each test by the same multipliers shown 
at the bottom of Table 2, added the weighted scores to get 
the pupil's composite score, converted his composite score 
into educational age according to Table 3, divided the educa- 
tional age by the pupil's chronological age to get E.Q., 
placed the pupil in the appropriate grade according to his 
educational age, his E.Q., and Table 4, after having made 
due allowance for the progress of the regular pupils since 
they were reclassified. 

III. Classification by Teacher's Judgment 

Inadequacy of Teacher's Judgment. — The teacher^s 
judgment is an inadequate basis for classification (a) when 



58 How to Measure in Education 

pupils first come to school, whether from the home or from 
some other school, and (b) even when pupils first enter the 
class from some other class in the same school. It is diffi- 
cult for a teacher to give a satisfactory judgment even after 
considerable experience. Teachers lie awake nights worry- 
ing whether or not to promote a given pupil partly because 
they are somewhat uncertain of their judgment. No judg- 
ment at all is available for pupils with whom the teacher is 
unacquainted. 

Again, the teacher's judgment is inadequate even when 
she knows her own pupils well. Teachers are not blind. 
They can distinguish between their brightest and dullest 
pupils with considerable accuracy. But lines of classifica- 
tion are rarely drawn through a single class. These lines 
usually break across several classes. A teacher may know 
her own pupils well and yet be unable to tell whether her 
ablest pupils are equal to, superior to, or inferior to the 
average or brightest pupil in some other class. Some form 
of measurement is needed which does not require a previous 
acquaintance with pupils and which compares pupils in 
different classes with as great ease as it compares pupils in 
the same class. 

Inaccuracy of Teacher's Judgment. — The very nature 
of subjective measurement and of the pupils being measured 
make a teacher peculiarly liable to err in estimating the 
ability of pupils. Teachers' judgments are subject to cer- 
tain constant errors. One such error is the tendency to 
confuse conduct with achievement. An estimate of a pupil's 
ability in reading is usually all tangled up with a cheery 
''good morning," a courteously opened door, close attention 
to the teacher's remarks, and other examples of impeccable 
behavior. It is probable that conduct (purposes) should be 
considered in classification, but conduct should not be 
allowed to cloud a teacher's estimate of abihties. 

Again, Terman finds that teachers frequently err in esti- 
mating intelligence because they fail to take age and emo- 
tional differences into account. An over-age or emotionally 



Measurement in Classifying Pupils 59 

vivacious pupil is usually rated too high and an accelerated 
or phlegmatic child is usually rated too low. Finally, 
Whipple finds after an unusually exhaustive study of gifted 
children, that while teachers are not likely to rate dull 
pupils as average or superior, they are likely to rate superior 
children as average. He therefore concludes that mental 
examiners are as much needed for selecting superior chil- 
dren as for selecting inferior children, because these mental 
examiners employ measuring instruments which are not sub- 
ject to these constant errors. 

Importance of Teacher's Judgment. — Teachers' marks 
are important because they are now and will continue for 
some time to be the most universal method of rating pupils. 
In fact, they may continue forever to be the criterion for 
classification, because teachers will soon be familiar with 
the simple mysteries of scientific measurement. They will 
themselves use tests with the same ease and fluency that 
they now use text-books. More and more they will base 
their judgments upon objective rather than subjective meas- 
urement. When this time arrives teachers' marks will be 
not only as accurate as objective measurement, but they will 
be objective measurement plus something else. 

This something else makes teachers' marks valuable even 
apart from objective tests. There are significant segments 
of each pupil's make-up which tests do not now touch. In 
occasional instances this segment has more importance 
for classification than the segment measured by tests. 
Teachers' judgments appear to be the only immediate hope 
of measuring pupils' purposes. Furthermore, the teacher 
can weight physical, emotional, and social characteristics in 
ways that tests cannot. In so far as these elements in the 
make-up of a pupil are aims of instruction, and in so far 
as they are improvable by school instruction, they ought 
to be weighted in determining promotions. Hence the 
teacher's judgment should be a factor in determining promo- 
tion, at least until the time arrives where these relatively 
intangible abilities are shown not to exist, or to be included 



6o How to Measure in Education 

in the score on objective tests, or shown to be abilities which 
are unimprovable by school instruction. 

Computation of Pedagogical Age. — The intelligence 
test determines mental age, the educational tests determine 
educational age, the status of a pupil as judged by a teacher 
may, for convenience, be called pedagogical age. But how 
can a teacher whose rating of pupils is confined to her own 
class and is relative to her own class determine a pedagogical 
age which is comparable to mental and educational age and 
may be combined with them to decide a pupil's classification? 

This is a problem which has been worrying psychologists 
and statisticians for many years. Many schemes have been 
proposed to increase the fairness and usefulness of teachers' 
marks. To be thoroughly satisfactory whatever plan is 
proposed must provide that a teacher's marks for pupils in 
one grade will be comparable with another teacher's marks 
for pupils in another grade or another class in the same 
grade. Allowance must be made for the fact that some 
classes are large and some are small, that some have un- 
usually stupid pupils while others have unusually bright 
pupils, and that some classes have a large variability while 
others have a small variability. Finally the plan must yield 
scores comparable with scores on tests, and which enable 
the school to establish permanent limits of ability for each 
grade and class for purposes of classification and place- 
ment. 

I have devised a simple scheme of marking which will 
meet fairly well all the above conditions. This plan follows: 

1. Reliably measure with some objective test or tests 
all the pupils of the school. This may be an intelligence 
test, a battery of educational tests, or a single educational 
test. If only one educational test is used perhaps the best 
would be one which measured the pupils' comprehension in 
silent reading. 

2. Compute mental age or educational age (or reading 
age) for each pupil. 



Measurement in Classifying Pupils 6i 

3. Arrange the pupils in each class separately in order 
of their mental age or educational age on the objective tests. 

4. Have each teacher rank the pupils in her class in 
order of how they should, in her judgment, be promoted. 
If several teachers rank the same pupils their ranks may 
be averaged. By multiplying one teacher's ranks by two 
or three that series of ranks may, if desired, be given double 
or triple weight. 

5. Give the pupil ranked highest by the teacher the 
highest mental or educational age made by any pupil in her 
class on the objective test. Give the pupil ranked second 
by the teacher the second highest mental or educational 
age made on the objective test. These are the pupil's peda- 
gogical ages. Thus the test bridges, for the teacher, the 
gap between grades and classes. The test shows the teacher 
where to begin rating, and how much variability to provide 
for. 

While a test will make occasional mistakes in the case of 
individuals, their determination of differences between 
groups and the variability of a group is approximately cor- 
rect, particularly if enough tests are given to make results 
adequate and reliable. It is probable that the true varia- 
bility of a group and the true differences between groups 
in mental age or educational age are about the same as the 
true variability and true differences, respectively, in the 
traits upon which the teacher bases her judgments. There 
may be a small error at the extremes of each class. Thus 
tests help the teacher where she most needs help, namely, 
in relating her pupils to pupils in other classes and grades. 
Tests tell her what scores to assign to the pupils in her class, 
but leave her free to decide who shall receive these scores. 
Pupils whose educational age and pedagogical age differ 
widely should receive special attention from the school psy- 
chologist. 

6. Average the mental or educational age and the peda- 
gogical age to get each pupil's promotion age. Or the 



62 How to Measure in Education 

pedagogical age alone may be used for a promotion age. 
If each pupil has a mental, educational, and pedagogical 
age, then all three may be averaged, giving any desired 
weight to each. The weight given will depend upon the 
number and nature of the tests used. 

7. Compute the Promotion Quotient for each pupil. The 
top pupils in a sample class are given below: 





Ch. 


Ed. 


Rank by 


Ped. 


Promotion Promotion 


^'upil 


Age 


Age 


Teacher 


Age 


Age 


Quotient 


r 


140 


144 


2 


138 


141 


100.7 


a 


150 


138 


3 


133 


135.5 


90.3 


h 


no 


133 


I 


144 


138.5 


125.9 


b 


121 


130 


4 


130 


130 


107.4 


etc. 


etc. 


etc. 


etc. 


etc. 


etc. 


etc. 



8. Classify pupils into grades on the basis of promotion 
age and into sections on the basis of Promotion Quotient. 

9. Since the Promotion Quotient shows the pupil's typi- 
cal rate of progress a little simple arithmetic will show what 
the promotion age of each pupil may be expected to be 
after a year of instruction. This expected age may, if de- 
sired, take the place of tests, but not of teacher's ranking, 
at the next promotion. This avoids the use of tests each 
year except for new pupils. 

Teachers^ judgments were not utilized in the reclassifica- 
tion of School X because the above plan is a subsequent 
discovery. 

Objections to Classification by Tests.— By way of 
summary it may be well to list the more common objections 
to promotions on the basis of educational or intelligence 
tests. 

I. Young pupils are forced to compete with the mentally 
more mature. This is a relic of the old notion that all pupils 
are born equal and that subsequent mental age keeps pace 
with chronological age. In general this objection represents 
a misplaced sympathy. Every investigation shows that it 
is a rule for the young pupils to be leading their classes and j j, 
for the older pupils to be struggling to keep up. Classifica- " ' 



f 



Measurement in Classifying Pupils 63 

tion by both educational age and E.Q. entirely meets this 
objection. 

2. Young pupils have difficulty in making social adjust- 
ments. It would be truer to say that older pupils have diffi- 
culty in adjusting to the younger ones. There is an 
undoubted tendency for older pupils to dislike the presence 
of a much younger pupil, for his presence is a standing 
insult to their intelligence. How serious this jealousy is 
needs to be investigated. Classification by both educational 
age and E.Q. completely meets this objection. 

But what will the gifted pupils do when they reach the 
high school while still very young? One suggestion is that 
they delay their arrival at the high school by taking a 
wider educational swath. If, however, the curriculum has 
been properly constructed this means that the gifted pupil 
will be spending almost half his time upon material of rela- 
tively small value. The only satisfactory solution is to 
provide a path for the geniuses which leads from the first 
grade through the university, so that the genius pupil may 
be with his kind throughout his entire educational career. 
If he graduates from the university while still too young he 
may be employed in national research or on other large 
social enterprises until he is judged sufficiently mature 
physically to take his place in the general social group. 

The only visible solution for the small school is to pro- 
mote the young gifted pupil just as often as the older pupils 
will permit without making his life miserable. Another 
solution is to abolish the school which is too small to make 
adequate provision for individual differences among its 
pupils and to substitute the consolidated school in its place. 

3. Causes vital gaps in pupil's education. Classifica- 
tion by educational age meets this objection provided the 
testing has been thorough. To receive a high educational 
score shows that these gaps have somehow been filled. If 
the educational testing has been inadequate pedagogical age 
may be included. 

It is difficult to believe that this is the real objection. It 



I 



64 How to Measure in Education 

can be demonstrated that older pupils have phenomenally 
large gaps in prerequisite abilities. But this does not seem 
to produce any particular concern. Worry comes only 
when the young pupil is involved. Educators find it almost 
as difficult as laymen to prevent themselves from thinking 
in terms of such irrelevant surface factors as chronological 
age, physical size, and brute muscles. We think of children 
as we would of elephants or dinosaurs. Considering how 
much of our lives are regulated by chronological age, this is 
not surprising. We are born at zero years of age, compelled 
to begin school at six, permitted to leave school at fourteen, 
allowed to marry at sixteen, entitled to vote at twenty-one, 
and are given an average salary at that age where a long 
life of usefulness is passing into decline. Squeezed in be- 
tween a chronological end and a chronological beginning the 
passage through school naturally becomes a chronological 
procession. 

4. Disregards health. Some picture the gifted child as a 
frail, forced, hot-house flower. Terman, after a careful 
study of many gifted pupils, concluded that they were no 
more frail than ordinary children. He found that some were | 
frail and some robust. Consequently if there is any reason 
to suppose that health will be sufficiently improved by giv- 
ing the pupil intellectually easy tasks, health should cer- 
tainly be considered. 

There is, however, a fear abroad that a pupil's mind may, 
like Jefferson's Constitution, be stretched until it cracks. 
In his Columbia University Master's thesis Franzen de- 
scribes a ten-year-old pupil in Grade V with an I.Q. of 178. 
This genius distinguished between poverty and misery, thus: 
''Poverty is the lack of things we need, misery is the lack 
of things we want." He defined a nerve as the ''conduction 
unit of sensation" and explained correctly what he meant 
thereby. It was discovered that he had read all the text- 
books of the grades ahead of him. His two able parents 
after a consultation with the family physician refused per- 
mission for him to be promoted from the grade where he 



Measurement in Classifying Pupils 65 

was bored almost to extinction because it might strain his 
mind! 

5. Emphasizes the intellectual to the exclusion of char- 
acter traits. This is another way of saying that pupils are 
classified by their abilities and not by their purposes. It 
may be that a pupil can be taught desirable purposes in 
one grade as easily as in another. It is certain that pur- 
poses do not fall into such close hierarchies as do abilities. 
Furthermore, it is possible that most pupils who are pro- 
moted for their intellectual achievements would likewise be 
promoted for their composite character status. Terman ^^ 
studied the extent to which intellectually gifted pupils pos- 
sessed the following intellectual and personal traits: sense 
of humor, power to give sustained attention, persistence, 
initiative, accuracy, will power, conscientiousness, social 
adaptability, leadership, personal appearance, cheerfulness, 
cooperation, physical self-control, industry, courage, depend- 
ability, self-expression through speech, intellectual modesty, 
obedience, popularity among fellows, evenness of temper, 
emotional self-control, unselfishness, and speed. Any reader 
would not complain of any lack if he possessed intelligence 
plus this galaxy of traits. Terman found that all these traits 
correlated positively with intelligence, that is to say, with 
ability primarily. The first trait, sense of humor, has, in 
the case of gifted children, a correlation of .58. The last 
trait, speed, correlates .28. The others gradually vary be- 
tween these extremes in the order named. Terman claims 
that he can roughly predict I.Q. from an average of these 
24 traits. 

The most common explanation given by teachers for the 
failure of certain specially promoted pupils to do satisfac- 
tory work is that they do not try. It is generally admitted 
that they could do the work of the grade if they only would. 
These remarks by teachers suggest two questions. ( i ) Since 
tests reveal that these pupils have actually mastered, some- 

12 L. M. Terman, The Intelligence of School Children, p. 58; Houghton Mifflin 
Co., New York City, 19 19. 



66 How to Measure in Education 

where, somehow, large segments of the curriculum, is it not 
possible that they are mastering the material of the new 
grade with such unobtrusive ease as to deceive even the 
keenest observer? (2) If the teachers are correct may not 
the pupil's lack of industry be due to improper habits formed 
by previous improper classification where industry was not 
required? 

Classification by both educational age and E.Q. or mental 
age and I.Q. will meet some of the above objections. The 
inclusion of pedagogical age will meet others, notably the 
last. Where the school is so small that it is impossible to 
classify by both educational age and E.Q. it is necessary to 
weigh such of the above objections as are relevant in the 
case of a particular pupil against the likelihood, if he is not 
promoted, that he will develop habits of inattention, intel- 
lectual laziness, and disorderly conduct. On all these ques- 
tions more research is sorely needed. Until fuller knowledge 
is available it behooves all users of tests for purposes of 
classification to exercise great care. 



CHAPTER III 

MEASUREMENT IN DIAGNOSIS 



I. Diagnosis of Initial Situation 

Plan of Discussion.— A principal or supervisor who 
assumes responsibility for a new school, or a teacher who 
takes charge of a new class is faced with the necessity of 
making two types of diagnoses: a general diagnosis of the 
initial condition and a more detailed diagnosis of the par- 
ticular defects of classes or pupils. 

In order to give concreteness to the discussion in this and 
the two succeeding chapters I shall keep in mind the 
teaching of reading. A reasonably competent student will 
be able to transfer the techniques described to other sub- 
jects. Marginal numbering will be made continuous 

TABLE 8 

Shows the Information Needed to Guide Instruction in Reading 



Pupil 



1. Chronological Age 

2. Initial Reading Score 

3. Initial Reading Age 

4. Initial Reading Quotient 

5. Initial Mental Age 

6. Intelligence Quotient 

7. Initial Accomplishment Quotient 

8. Estimated Final Reading Age . . 

9. Estimated Final Mental Age . . . 

10. Reading Score Objective 

11. Final Reading Score 

12. Final Reading Age 

13. Final Accomplishment Quotient. 

14. Final A.Q. Minus Initial A.Q.. . 



67 



A 



121 

40 

121 

100 

121 

100 

100 

131 

131 

43 

43 

130 

99 



123 

43 
130 

106 
150 
122 

87 
142 
162 

47 

50 
150 

93 



C 



124 
30 
93 
75 
83 
67 
112 
100 
90 

33 

34 

104 

116 



D 



118 

31 

96 

81 

100 

85 
96 

105 
109 

34 

36 

no 

lOI 



100 

38 

116 
116 
III 
III 
105 
127 

122 

42 

43 
130 
107 



Mean 



36.4 



97 
100 



39.8 
41.2 

103 

+ 3 



68 How to Measure in Education 

through the three chapters in order to indicate the steps in 
the total process of using tests in teaching. 

1. Determine Each PupiFs Chronological Age. — 
Line i in Table 8 shows the chronological age in months of 
five selected fourth-grade pupils. (These five pupils will be 
treated hereafter as though they constituted an entire 
fourth-grade class.) 

2. Determine the Initial Ability to Comprehend 
What Is Read Silently.— The Thorndike-McCall Read- 
ing Scale may be used to determine each pupil's initial 
ability to understand what he reads. An easy and difficult 
portion of this scale, together with the directions which 
accompany it, are reproduced below. 

Read this and then write the answers. Read it again if you need to. 

Nell's mother went to the store on Water Street to buy ten pounds of 
sugar, a dozen eggs and a bag of salt. She paid a dollar in all. Nell and 
Joe went with her. On the way home on Pine Street, they saw a fire-engine 
with three horses. 

1. Was the salt in a box or a bag or a can or a dish ? 

2. How many eggs did she buy ? 

3. What did the children see on Pine Street? 

4. What street was the store on ? 

Read this and then write the answers. Read it again if you need to. 

COLERIDGE 

I see thee pine like her in golden story 
Who, when the web — so frail, so transitory, 
The gates thrown open — saw the sunbeams play 
With only a web 'tween her and summer's glory; 
Who, when the web — so frail, so transitory, 
It broke before her breath — had fallen away, 
Saw other webs and others rise for aye. 
Which kept her prisoned till her hair \vas hoary. 
Those songs half-sung that yet were all divine — 
That woke Romance, the queen, to reign afresh — 
Had been but preludes from that lyre of thine. 
Could thy rare spirit's wings have pierced the mesh 
Spun by the wizard who compels the flesh, 
But lets the poet see how heav'n can shine. 



Measurement in Diagnosis 69 

30. Who acted like a spider ? 

31. Who or what is compared with the woman ? 

32. Copy the first word of the line which implies there had not been a 
continuous stream of like songs? 

33. Complete the following with one word only: 

"Those songs" really means those 

Directions for Using 
Thorndike-McCall Reading Scale 

Form 2 

HOW TO APPLY TEST 

To the Examiner: Distribute test booklets, with front 
page up. Request pupils not to open booklets until the 
signal is given. Have pupils fill out the blanks at top of the 
front page. Read the paragraph aloud while pupils read 
silently. Read the first question aloud. Have it answered 
orally and then in writing by the pupils. Stop pupils thirty 
minutes after saying Open paper I Begin I Give no further 
help. 

HOW TO SCORE TEST 

It is suggested that the examiner answer the questions on 
the first test page, study the scoring key below, score the 
first test page for the entire class, and then repeat the 
process for the other test pages. The four questions on the 
front page should not be scored. The scoring key gives 
correct answers, in some instances followed by incorrect 
answers. The only portion of the answer required for cor- 
rectness is in italics. The scoring key is by no means 
exhaustive. Other answers are to be scored as correct or 
incorrect in accordance with the following suggestions: 

1. Any answer equal in merit to the worst answer called 
correct in the key is to be scored correct. 

2. Complete sentences are not required. 

3. Call wrong a correct answer plus an incorrect or irrele- 
vant answer. 



70 How to Measure in Education 

4. Call correct a correct answer plus a harmonious addi- 
tion. 

5. Call incorrect any misplaced, omitted, or illegible 
answers. 

6. Do not call an answer wrong because of alteration, use 
of abbreviation, or errors in capitalization, punctuation, 
spelling, or grammar unless any of these indicate actual 
misreading. 

7. In general the pupil's answer must show that he has 
correctly read both the paragraph and the question and 
is able to word a readable response to both. 

SCORING KEY 

The only portion of the answer required for correctness is in italics. 
♦ Incorrect answers. 

1. The salt was in a bag. 

2. She bought a dozen eggs; 12. 

3. The children seen a fire-engine with three horses. 

4. Water Street; Walter Street. 

5. He killed the fox; Kill it. 

* Killed; Shot him. 

6. The rabbit was brown. 

7. In the woods; In the forest. 

* Woods. 

8. Beauty had three brothers; Three boys; 3. 

9. Yes; Younger than both; She was younger. 

10. No; No, Beauty; The youngest. 
♦Beauty: Young. 

11. Jealousy; Jealous. 

12. Music lessons; Piano lessons; Play piano. 

* Grace's music ; Singing ; Piano ; Music and hens. 

13. No; Grace sell them; Not. 

14. Yes; Less; He doesn't like it at all. 

* Fred does not like music. 

15. Richard; Richard and Edward. 

* Edward ; Richard and Henry. 

16. Sam and Henry; Sam and Edward. 

* Edward doesn't like Sam ; Sam, Henry, Edward. 

17. Arthur and Richard. 

18. Henry and Edward; Henry; Edward. 

19. One Hour; From ten minutes of seven to ten minutes of eight; He 
left the house at ten minutes of seven and reached the store at ten 
minutes of eight. 

♦Almost an hour. 




Measurement in Diagnosis 71 

20. Once a week; One time; One day; Every Sunday; Only Sunday; 
On Sunday; Sundays; At Sunday; i each week; i a week. 

* Sunday ; One. 

21. He rose, dressed, ate breakfast, and left house. 

* Dressed and left house ; Got ready for work ; Rose, dressed, ate 
breakfast, and worked. 

22. No. 

23. Farmers too prosperous to trouble about it; Because the men are so 
rich. 

* Too rich to take the trouble; Because harvesters not careful; 
Farmers are getting prosperous. 

24. Out in the wheat fields of Kansas; In Pawnee County; On the fields 
where wheat was left. 

* In harvest field ; Gathering wheat ; Going over wheat fields ; In 
a wheat field in Pawnee County. 

25. After the planting. 

* After planting is done and weeds are killed ; When dry or weedy ; 
After the crop begins to grow. 

26. Plowing planting sowing weeding watering loosening (any two) ; 
Plows and plants. 

* Plowing hoeing ; Plowing spading ; He digs and plants ; Loosened 
and watered. 

27. Shyness and tendency to stammer; Stuttering and shyness; Retiring 
nature and speech; Retiring stammering; He stammered and shy. 

* His shy and retiring nature. 

28. On mathematics; Mathematics and preaching; Lectured on mathe- 
matics. 

* Lecturer on mathematics. 

29. He has written funny books and poems and made lantern pictures; 
Studying, writing books; Narrative books; He was a writer; Alice 
in Wonderland. 

* Wrote. 

30. The wizard; Disease; Illness. 

* The lizard ; Spun by the wizard. 

31. Coleridge; Thee; Poet; Life of the poet; One to whom the poem 
is addressed. 

* A man was compared ; The person whom the author is writing 
about; Coleridge's writings. 

32. That. 

33. Poems; Verses; Lines; Stanzas. 

* Pieces of poetry ; Lyrics. 

34. Methods operation; Method operation; Two methods. 

♦Two methods, operation; An operation, methods; An operation; 
Experiment operation. 
35- Separation resolution. 

* Separation dissemble. 

HOW TO COMPUTE PUPIL SCORE 

To compute a pupiPs score, first total the number of 
questions he answers correctly, look up this total in the first 



72 



How to Measure in Education 



column of Table 9, and read in the second column the cor- 
responding T score. The T score is the pupil's score and 
should be tabulated as in Column III of Table 12. 

TABLE 9 



Questions T 
Correct Score 



.23 

.25 

,26 

.26.5 

,27 

,28 

,29 

■ 31 
.32 



Questions T 

Correct Score 



9- 
10. 
II, 
12. 

13. 

14. 

IS- 
16. 

17. 



■2>Z 
•34 
•35 
.36 
•37 
.38 
.39 
.40 
.42 



Questions T 

Correct Score 



18. 
19. 
20. 
21. 
22. 

23- 
24. 

25. 
26. 



•43 
•45 
.47 
.49 
.51 
.53 
.56 
.58 
.60 



Questions T 

Correct Score 



27, 
28. 
29. 
30. 
31- 
32. 
33- 
34. 
35. 



■ 63 
.67 

.71 
,76 

.79 
.82 
,86 
,92 
,96 



HOW TO COMPUTE CLASS SCORE 

The class or grade score is the arithmetic mean of the 
pupils' scores, as shown beneath Column III in Table 12. 

HOW TO COMPARE CLASS OR GRADE WITH GRADE NORM 

Find the appropriate grade norm in Table 10, and write it 
beneath the class or grade score as shown in Table 12. 
The norm for IV B is shown in Table 12 because the tabu- 
lation is for a IV B class. 



TABLE 10 

Grade Norms 2A-7B 



At the End of 


2A 


2B 


3A 


3B 


4A 


4B 


5A 


5B 


6A 


6B 


7A 


7B 


Mean Norm 


26 


30 


33-7 


37-3 


39-6 


41.8 


44-9 


48.0 


50.9 


53^7 


S6.0 


S8.3 


Approx. No 

of Pupils in 

Thousands 


0.2 


0.3 


3 


5 


S 


10 


S 


10 


S 


10 


S 


10 



Grade Norms 8A-12B 



At the End of 


8A 


8B 


9A 


9B 


loA loB 


iiA iiB 


12A 12B 


Superior 
Teachers 


Mean Norm 


59.6 


60.9 


61.5 


62.1 


62.9 63.6 


64.S 65.4 


66.8 68.1 


72 


Approx. No. 

of Pupils in 

Thousands 


S 


10 


Est. 


I 


Est. I 


E^t. I 


Est. I 


0.3 



Measurement in Diagnosis 



73 



HOW TO COMPARE PUPIL WITH AGE NORM 

To compare each pupil with age norms the examiner 
should look up the pupil's T score in the first column of 
Table ii, find in the second column the corresponding Read- 
ing Age, and divide the Reading Age so found by the pupil's 
chronological age to find his Reading Quotient, as shown in 
Columns I, IV, and V of Table 12. The Reading Quotient 
is 100 for the child with normal reading ability, and pro- 
portionately below or above 100 for the inferior or superior 
reader respectively. All quotients are multiplied by 100 to 
ehminate decimal points. 



TABLE II 

Reading Age Norms 



T Reading 
Score Age 



21 67 

22 70 

23 72> 

24 76 

25 79 

26 82 

JZ-- 44 

28 87 

29 90 

30 93 

31 96 

32 99 

33 loi 

34 104 

35 107 

36 no 

37 113 

38 116 

39 118 

40 121 



T Reading 
Score Age 



41 124 

42 127 

43 130 

44 133 

45 135 

46 138 

47 141 

48 144 

49 147 

50 150 

51 152 

52 155 

53 158 

54 161 

55 164 

56 167 

57 169 

58 172 

59 175 

60 178 



T Reading 
Score Age 



61 181 

62 184 

63 186 

64 189 

65 192 

66 195 

67 198 

68 201 

69 203 

70 206 

71 209 

72 212 

7:^ 215 

74 218 

75 220 

76 223 

yy 226 

78 229 

79 232 

80 235 



T Reading 
Score Age 



81 238 

82 240 

83 243 

84 246 

85 249 

86 252 

87 255 

88 257 

89 260 

90 263 

91 266 

92 269 

93 272 

94 275 

95 278 

96 281 

97 284 

98 287 

99 290 

100 293 



HOW TO INTERPRET T SCORES 



The unit proposed above for measuring reading ability 
or any mental ability has been called T. The number of 



74 Bow to Measure in Education 

TABLE 12 

Sample Score Sheet 
School No. II, Grade IV B. Teacher Miss X. Date June 15, igso. 



Chron. Age 

in Mos. 



124 

120 
i48 
134 



II 

Pupils' 
Names 



Adams, Sam 
Baker, Mary 
Davis, Geo. 
Evans, Asa 



III 

T 

Score 



27 

52 
4o 
46 



Mean 

Grade Norm 



41.3 
4i.8 



IV 

Reading 

Age 



155 
121 

138 



V 

Reading 
Quotient 



68 

I2Q 

82 

103 



Mean 95.5 

Norm 100 



questions correct is not a satisfactory unit for measuring 
reading ability because the difference in difficulty between 
33 and 34 questions correct may be greater or less than 
between 12 and 13 questions correct. The difference in 
difficulty between 33 T and 34 T always equals that between 
12 T and 13 T. 

Again, T scores make possible such statements as the 
following: Any pupil whose T score is 50 has an ability 
which equals the average ability of all twelve-year-old chil- 
dren in the nation. Any pupil whose T score is 70 has an 
ability which is 20 T (or 2 sigmas) above the average ability 
of twelve-year-olds. Any pupil whose T score is 35 is 15 T 
(or 1.5 sigmas) below the average abihty of twelve-year- 
olds. 

Again, T scores may be interpreted thus: 





Is Exceeded by 




Is Exceeded by 


A 


the Following 


A 


the Following 


T Score of 


Per Cent of 


T Score of 


Per Cent of 




1 2 -year- olds 




1 2 -year-olds 


25 


99 


55 


31 


30 


98 


60 


16 


35 


93 


65 


7 


40 


84 


70 


a 


45 


69 


75 


I 


50 


SO 


80 


0.1 



Measurement in Diagnosis 



IS 



HOW TO INTERPRET READING QUOTIENTS 

How to interpret a Reading Quotient for a pupil or the 
mean Reading Quotient for a class is shown in the third 
column below. The second column gives the percentage of 
5,000 pupils in grades II through VIII who had the Reading 
Quotients indicated in the first column. 



Reading 
Quotient 


Per Cent 
of Pupils 


Interpretation 


Below 55 


0.1 




SS to 65 


2.4 


Exceptionally inferior 


65 to 75 


8.0 


Very inferior 


75 to 85 


13-5 


Inferior 


85 to 95 


18.0 


Low average 


95 to 105 


19.6 


Average 


105 to 115 


17.0 


High average 


115 to 125 


lo.s 


Superior 


125 to 13S 


5.8 


Very superior 


135 to 145 


3.5 


Exceptionally superior 


145 up 


IS 





There are at least two useful ways of expressing the read- 
ing ability of a pupil (or of a class). How well a pupil 
reads is shown first by a comparison of his reading score 
with the norm for his grade. The use of this method alone 
encourages the school to retard pupils chronologically either 
unconsciously or consciously in order to give an appearance 
of high efficiency. Given a sufficient percentage of over- 
ageness or under-ageness, almost any school can appear 
efficient or inefficient respectively when compared with grade 
standards. 

How well a pupil reads is shown, second, by a comparison 
of his reading score with the norm, for his age. This is just 
what is expressed by the Reading Quotient. Showing as it 
does what the school has accomplished for the pupil by a 
given age rather than a given grade, the Reading Quotient 
has very great value, because the school cannot raise the 
Reading Quotient above 100 nor depress it below 100 by 
creating undue chronological retardation or acceleration 
respectively. 



76 



How to Measure in Education 



SUGGESTIONS FOR ECONOMIZING TIME 

When the examiner has some skill in statistical procedure 
and is interested in class or grade or age scores only, it is 
not necessary to convert each pupil's number-of-questions- 
correct into a T score. Instead it will be a great saving of 
time to make a frequency distribution showing the number 
of pupils making each number-of-questions-correct. The 
first column of the frequency distribution can then be con- 
verted into T scores and the mean computed. 

Again, when the examiner is interested only in reading 
ages and Reading Quotients time can be saved by copying 
column I of Table 9 into Table 1 1 . This will save one step 
in the process of converting the number-of-questions-correct 
into a reading age. 

Reading Achievement of Various Cities 



At the End of 

10 Miscellaneous Cities 
33 Wisconsin Cities. . 

18 Indiana Cities 

Louisville, Ky 

New York City, N. Y. 

Paterson, N. J 

San Francisco, Cal... 
St. Paul, Minn 



Ill 


IV 


V 


VI 


VII 


VIII 


IX 


X 


XI 


32.8 


40.9 


46.3 


S1.7 


58.0 


59.8 


60.7 


62.5 


64.3 


3«.2 


40.9 


47.2 


52.6 


55.3 


58.0 








40.0 


49-9 


58.9 


67,0 


68.8 


71-5 










39-1 


43-6 


Si-7 


59.8 


60.7 








36.5 


41.0 


47.5 


51-5 


55.8 


58.4 










35-5 


40.9 


49.0 


51.7 


53-5 








37-Z 


45-4 


53-5 


52.6 


58.9 


62.5 








39-1 


41.8 


46.3 


53.5 


58.0 


62.5 









XII 

67.0 



ACCURACY OF NORMS 

It is impossible to determine exactly how many pupils are 
included in the grade norms which accompany this test. 
An extremely conservative estimate is attempted in Table 
10. The age norms given in Table 11 were read from a 
smoothed curve based upon the following number of pupils: 



Age 


7-8 


8-9 


9-10 


[O-II 

1462 


11-12 


12-13 


13-14 


14-iS 


iS-16 


16-17 


Adult 


Pupils 


128 


635 


1275 


1561 


1833 


1662 


1112 


430 


65 


300 



3. Convert the Initial Reading Score into a Read- 
ing Age. — The foregoing Directions for Using Thorndike- 
McCall Reading Scale shows how to convert reading scores 



Measurement in Diagnosis 77 

into reading ages. The reading ages for our five pupils are 
shown in line 3 of Table 8. 

The determination of reading age on the Thorndike- 
McCall Reading Scale is a simple matter because this test 
has age norms. In case the teacher is using some reading 
or other test which has grade norms and not age norms, the 
grade norms may be transmuted into age norms by means 
of a technique described in the preceding chapter. 

One important function of this initial inventory of read- 
ing is to prevent re-teaching of abilities which have already 
been taught. Ayres and others have estimated the tremen- 
dous financial cost to the public of the 2>2> P^r cent of 
retardation in the schools. Someone has computed that 
$40,000,000 are spent annually re-teaching pupils. No one 
has been willing to estimate the loss to the retarded pupils. 

Unfortunately the real retardation and its cost have been 
little studied. Retardation studies have called pupils re- 
tarded who were not retarded and overlooked retardation 
which was really present. This is still another fallacy which 
has resulted from a superficial study of surface appearances 
and one more argument for the use of educational tests to 
increase visibihty for the really significant factors. Most 
pupils who are chronologically retarded are not education- 
ally retarded at all. The only true cases of retardation are 
pupils who are kept below the grade for which they are 
fitted by educational age. Most of the chronologically 
retarded are where they belong educationally or, to be more 
exact, they are usually a little accelerated. It is the chrono- 
logical accelerates who are usually most retarded educa- 
tionally. Thus educational measurement justifies the rather 
queer conclusion that chronological retardation tends to 
mean educational acceleration. Contrary to usual thinking, 
the chief cost of re-teaching occurs with the latter rather 
than the former group of pupils. It is the chronological 
accelerates who are educationally retarded and who are re- 
taught something they already know. The chronologically 
retarded are, on the whole, re-taught something which they 



1 



78 Hoiv to Measure in Education 

failed to learn from one teaching. Of course, re-teaching 
these mentally inferior pupils is costly, but in the long run 
it is probably less expensive than to permit them to proceed 
without adequately mastering the prerequisites. ' The func- 
tion, then, of the initial inventory is to prevent the cost to 
pupil and public of re-teaching what has really been learned. 
A second function of the initial inventory is to avoid 
premature teaching. We have already seen how pupils are 
frequently started on a phase of the curriculum which, in 
the light of their measured capacity to learn, is too difficult 
for them. We saw again that pupils are frequently re- 
quired to learn a portion of the curriculum before they have 
learned certain prerequisites in the hierarchy. The initial 
inventory will not only prevent such premature teaching in 
general, but will definitely point out for the guidance of 
both learner and teacher just where the pupil is most de- 
ficient and hence where he most needs help. Two of the 
great wastes in education are due to re-teaching or prema- 
ture teaching. An adequate initial inventory will prevent 
both. Says Foote, "When pupils and teachers know where 
they are and where they are to go there is reason to believe 
that the journey will be accomplished; otherwise it is very 
doubtful." It is the function of the initial inventory to show 
pupils and teachers where they are. 

4. Determine the Reading Quotient, — The previ- 
ously quoted Directions for Using Thorndike-McCall Read- 
ing Scale describes how to compute and interpret Reading 
Quotients. A similar procedure and interpretation applies 
to Arithmetic Quotients, Spelling Quotients and the like. 

5. Determine the Initial Mental Age, — It is recom- 
mended that the teacher use, for determining the mental 
age of each pupil, the Stanford Revision of the Binet- _ 
Simon Intelligence Scale sold by Houghton Mifflin Com- || 
pany. New York City, or in case the school is in a decidedly 
foreign neighborhood, the Pintner-Paterson Scale of Per- 
formance Tests sold by Warwick & York, Baltimore, or 
the National Intelligence Test sold by World Book Com- 



Measurement in Diagnosis 79 

pany, Yonkers-on-Hudson, N. Y. The first two tests 
require fairly expensive equipment and a trained examiner. 
The National Intelligence Scale is relatively inexpensive, 
can be given to all the pupils in a class or school at one 
time, and yields fairly satisfactory results even when used 
by a complete novice. Most teachers should use the last 
test. For grades below the third the teacher may use either 
Haggerty's Intelligence Test, Delta I, World Book Com- 
pany, Yonkers-on-Hudson, N. Y., or Pressey's Primer Scale, 
University of Indiana, Bloomington, Indiana, or Dear- 
born's Intelligence Test for these grades, Lippincott Com- 
pany, Philadelphia, Pa. 

The method of scoring the first two tests yields a mental 
age score directly. The score on the National Intelligence 
Test must be converted into a mental age just as the reading 
scores were converted into reading ages. Age standards are 
sent with this test. 

The initial mental age of pupil A is 121. It, together 
with the mental ages for the other 4 pupils, is shown in line 
5 of Table 8. 

6. Determine the Intelligence Quotient. — A pupil's 
Intelligence Quotient (I.Q.) is the quotient of his mental 
age divided by his chronological age. Thus pupil A's I.Q. 
is 121 divided by 121, i. e., 100. The I.Q. of pupil B is 150 
divided by 123, i. e., 122. The I.Q. for each of the other 
three pupils is shown in line 6 of Table 8. 

A knowledge of a pupil's I.Q. should be of very great 
value to any teacher of any subject, for the size of a pupil's 
I.Q. is an index of his general mental brightness or mental 
alertness. As Terman points out the most important fact 
about a pupil, next to character, is his I.Q. The significance 
of I.Q. 's of varying sizes is brought out below: 

Above 140 Genius or near genius. 

120 — 140 Very superior intelligence, 

no — 120 Superior intelligence. 

90 — no Normal or average intelligence. 

80 — 90 Dullness. 

70 — 80 Borderline deficiency, sometimes feeblemindedness. 

Below 70 Definitely feebleminded. 



8o How to Measure in Education 

These determinations of reading age and Reading Quo- 
tient, and mental age and Intelligence Quotient not only 
furnish valuable teaching guides but also provide the basis 
for educational guidance through a knowledge of a pupil's 
capacity to profit by general education and pursue particular . 
subjects. I' 

General Capacity to Learn. — One problem in education 
is to locate the educational objectives. Another is to locate 
somebody who has the capacity to attain these objectives 
— to find somebody who is educable. Pigs, sheep, cows, 
horses, dogs, and other domesticated animals have widely 
varying capacities to learn. While the percentage of illit- 
eracy is high, these animals have a more or less definite 
curriculum and are taught certain lessons by their owners. 
It is only human beings, however, who are considered to 
have enough capacity to learn to make systematic, pro- 
longed education profitable. 

But the technique of diagnosing capacity to learn does 
not end with the classification of an animal as a human 
animal. The range of capacity to learn among humans is 
greater than the difference between humans in general and 
dogs in general. The overlapping of capacity is so great 
that a considerable per cent of humans have a capacity 
to learn which is inferior to the geniuses among dogs, cats, « . 
monkeys, and other much reviled creatures. 1! 

In brief the procedure in diagnosing learning capacity is 
to measure what the individual can learn or has learned. 
The former method is to place a child, say, in a novel situa- 
tion and score how well he learns. The latter method is 
to assume that he has since birth been in a learning situa- 
tion and to measure how much he has learned. 

The first measurements of capacity to learn were simple 
unstandardized observation of children by parents and || 
neighbors. These measurements were inevitably inaccurate 
because of numerous constant errors such as parental 
vanity, neighborly jealousies, absence of constant or fair 
standards of estimate as well as other more subtle errors. 



Measurement in Diagnosis 8i 

These subjective measurements were probably more accu- 
rate a generation ago than at present, because numerous 
progeny facilitated the development of surer standards of 
measurement. 

As a result of parental measurements the extremely stupid 
children were kept at home while those with a greater 
capacity were sent to school. Many of the stupidest ones 
were committed to institutions for the feebleminded, while 
the stupider ones were sent to special schools, classes, tutors, 
and the like. 

When children are dumped into the hopper of the educa- 
tional mill, they enter a great and more accurate selective 
machine. Every stage of education from the kindergarten 
through the university is engaged in the process of selection. 
In a very real sense our schools are as much selective as 
educative agencies. Every teacher takes her toll, though 
gallantry forbids saying that this is a case of the devil takes 
the hindmost. There is a miller whose water mill on the 
Tennessee River grinds the corn for the farmers for miles 
around. The miller always takes his toll from the best bag 
of corn. Teachers are more generous; they take toll from 
the children whose capacity to learn is least. Each year 
the ranks of this grand army of children grows thinner and 
thinner. The Ph.D. or the equivalent is the educator's 
reward for the students who have been able enough or clever 
enough to escape the clutch of all the teachers. 

More and more educational selection is becoming an im- 
portant function of the school. Children are being com- 
mitted to institutions for the feebleminded. This is 
frequently construed as a stigma upon both children and 
parents. -Private schools deny entrance to children whose 
learning capacity is judged to be below a certain standard. 
Public schools are sending pupils to special classes for the 
mentally slow. Stupid pupils are denied promotion. Cer- 
tain public schools group pupils within each grade accord- 
ing to learning capacity. Other public schools refuse 
admission to any whose learning capacity is not unusually 



82 How to Measure in Education 

great. Some countries, recognizing that its greatest asset 
is its children of genius and that these geniuses belong to 
the community rather than to particular parents, are select- 
ing these children for a special education. 

When matters of such critical importance to the individual 
are at stake a democracy will not long tolerate a system of 
educational selection which does not utilize the most thor- 
oughly scientific, impartial, impersonal, and rigidly stand- 
ardized technique possible. Standardized educational and 
psychological tests, inaccurate though they may be, are 
rapidly becoming recognized as the best means for educa- 
tional selection. It is but a question of time until they 
supplant the traditional selective mechanism of home and 
school. 

The use of tests for selective purposes unquestionably 
demonstrated their worth during the recent war. Influenced 
by this demonstration, Columbia College adopted the 
Thorndike College Entrance Intelligence Tests as a part 
of the machinery for selecting its students. Other colleges 
and many lower schools admit students on the basis of 
standardized measurements, while commitment of chil- 
dren to institutions for the feebleminded has long been a 
function of the Binet-Simon Intelligence Test. 

Lack of suitable group tests has delayed the m.ovement 
toward the adoption of standard tests as a means for deter- 
mining learning capacity. This lack is now being rapidly 
supplied. With the completion of the National Intelligence 
Tests, we have an instrument whose use should be universal 
for numerous purposes, particularly for discovering in- 
stances of unusual capacity. The last chapter in this book 
lists many other tests which will be found helpful. 

Psychologists are now able to tell with considerable 
accuracy whether a child possesses an I.Q. which will ever 
make it possible for him to do the work of a particular 
school or institution or grade in a school. Further, they 
are able to determine whether a child's mental age is now 
sufficient to learn the work of a particular grade. Ter man's 



Measurement in Diagnosis 83 

experience leads him to the conclusion that the 60 I.Q. 
pupil will not be able to do work beyond Grade III or IV. 
The 70 I.Q. child will not be able to do work beyond Grade 
V or VI. The 80 I.Q. will reach his limit about Grade VII. 
The 90 I.Q. pupil may by dint of much persistence go 
through high school. E.Q.'s of 60, 70, 80 and 90 for pupils 
whose educational opportunities have been normal may be 
interpreted like similar I.Q.'s. Even the attainment listed 
above cannot be reached until the mental age or educational 
age has sufficiently developed and this means considerable 
chronological retardation. 

Capacity to Learn a Particular Subject. — Mary inquires 
of her teacher whether it is wise, considering the strength 
of her purpose and the extent of her capacity, for her to 
study high-school mathematics. John v/alks into the prin- 
cipal's office and innocently asks such a simple question as : 
"Do you consider it best for me to study a foreign lan- 
guage?" Similar questions may be, though they less fre- 
quently are, asked about elementary school abilities. The 
determination of the capacity to learn in a particular subject 
is more difficult than to determine capacity in general. 
Three methods have been employed: 

1. A measurement of the pupiFs previous achievement 
and rate of progress in a subject which continues into the 
future. This method measures capacity inclusive of pur- 
pose. Reading, writing, arithmetic, spelling, composition, 
etc., are subjects which are more or less continuous through 
the elementary school. Ability in these subjects can be 
measured by group tests from about Grade III through 
Grade VIII. Hence within these ranges pupils may be 
advised for the future on the basis of their Reading Quo- 
tients, Spelling Quotients, etc., in these same subjects in the 
past. In the high school the student may be assigned tenta- 
tively to a certain study and asked to pursue it for a brief 
period, after which he is measured. His prospects may be 
estimated from his recent progress. 

2. A measurement of the mental abilities upon which 



84 How to Measure in Education 

success in the subject to be pursued depends. This method 
includes purposes only indirectly. Success in high school 
mathematics, for example, depends upon the possession of 
certain special mental abilities. Dr. Rogers ^ attempted to 
analyze the mental elements in mathematical ability and to 
construct tests for their measurement. If the analysis is 
correct, and if the tests really measure the subordinate abili- 
ties discovered, and if a pupil is shown by the tests to be of 
the requisite abilities, then we can make an accurate prog- 
nosis concerning his prospects in, say, geometry. 

3. A measurement of general intelligence. This method 
of making a prognostic diagnosis makes no pretense to a 
precise analysis of the abilities required. It is known that 
scholastic achievement, whether in the elementary school, 
high school, or college, is closely correlated with general 
mental ability. Hence a knowledge of the pupil's intelli- 
gence enables one to make a fairly accurate prognosis. 

The whether-to-pursue-a-subject question is intimately 
allied with the when-to-begin-a-subject question. The for- 
mer must consider whether the value, purpose, and capacity 
are now or ever will be adequate. The latter assumes that 
purpose and capacity will be adequate when experience and 
training have accumulated and maturity is far enough ad- 
vanced. The answer to both questions requires a knowledge 
of the limit of purpose plus capacity which must be exceeded 
to bring success. 

Our ignorance of this limit is not absolute. Educators 
and psychologists are now able to set crude limits, but they 
are very crude. A short time ago a teacher who is respected 
and admired for her knowledge of children and her skill in 
teaching them confessed that she did not know, for example, 
such a simple fact as the time when a child has both a pur- 
pose and capacity for reading. She proposed to set about 
determining this time. 

But the definition of these limits is not as simple a prob- 

^ Agnes L. Rogers, Experimental Tests of Mathematical Ability and Their Prog- 
nostic Value; Bureau of Publication, Teachers College, Columbia University, 
N. Y. C, 1918. 



Ji 



Measurement in Diagnosis 85 

lem as she conceived it to be. It is not simple because *| 
ability to read does not suddenly and all at once leap into »! 
existence. It gradually grows. By dint of enough effort ^ 
reading could be taught much earlier than it is. Further- 
more, ability for other activities is developing at the same 
time, so it becomes necessary to decide which of many abili- 
ties reach white heat first. Again, individuals differ in their 
rate of development. Finally, the whole question is com- 
plicated by the number of years of training required to reach 
the desired goal, and the age at which the pupil's daily needs 
require the attainment of a given goal. 

Unfortunately lack of knowledge of this whole question 
has resulted not so much in a strenuous desire for scientific 
study as in a sort of false sentimentalism. Many pupils 
are working far below their possible efficiency because of 
a fear that their neurones will be injured — that their minds 
will be strained! If minds were so easily strained, infants 
would be mentally stunted for life. The neurones are never 
again subjected to such a strain as when both parents and 
relatives are trying to wrest from the neural mechanism a 
faint approximation to ^'ma-ma,'' ''pa-pa." Until he is . 
entirely ready to gratify these parents, uncles, and aunts, 
the baby sweetly and tolerantly smiles at their silly behavior, 
and continues to protect his tender neurones. Nature has 
the same excellent device for protecting pupils. When the 
pupil is presented with a problem beyond his capacity 
Nature automatically cuts the circuit and nothing happens. 

7. Determine the Initial Accomplishment [Quotient. 
— The customary procedure of comparing the class score 
with the grade norm is not as useful for measuring past 
efficiency as an Accomplishment Quotient, because the cus- 
tomary procedure totally disregards the intellectual calibre 
of the class. The Accomplishment Quotient is found by 
dividing reading age by mental age. Pupil A's reading age 
in Table 8 is 121 and his mental age is 121. Hence his 
Accomplishment Quotient for comprehension while reading 
silently is 121 divided by 121, i. e., 100. The Accomplish- 



86 How to Measure in Education 

ment Quotients for the other pupils are shown in line 7 of 
Table 8. 

The Accomplishment Quotient is the most exact present- 
day measure of the efficiency of study, instruction, and 
supervision ; it is the only just basis for reporting to parents 
and for judging pupils; and it is the best index of what 
pupils need special attention and spurring, of what pupils 
need restraining perhaps, and of what pupils need to be 
"let alone." 

It is a common occurrence for pupils of low general intel- 
ligence to be placed in their chronological group and told 
to keep up with the class or be publicly stigmatized, officially 
denied promotion, and corporally punished at home. The 
Accomplishment Quotient holds out a promise of much- 
needed justice to such pupils. It asks the pupil to progress 
at a rate which is proportional to the mental capacity with 
which nature endowed him. 

It is a common occurrence for a pupil to be urged for- 
ward when for the sake of his health, possibly, he should 
be restrained, and for another pupil to be restrained who 
should be prodded. Pupil C is at the bottom of the class. 
In the conventional school he would be judged highly inef- 
ficient and all the weight of the school would be brought 
to bear upon him. Pupil B on the other hand would be 
praised for his excellent work, and would be given high 
marks. As a matter of fact pupil C has made a better use 
of his capacities than any other pupil in the class and pupil 
B is about the most inefficient. Because of improper instruc- 
tion pupil B has not ''kept the traces taut.'' 

It is a common occurrence for teachers to be assigned 
to a class of stupid pupils and to be expected to produce 
standard results. Such a policy is very unfair. The Accom- 
plishment Quotient promises protection against such injus- 
tice. 

Sometimes a teacher is assigned to a group of pupils of 
normal intelligence which has been improperly taught for 



Measurement in Diagnosis 87 

several years. It is unfair to expect a teacher to overcome 
in one year several years of neglect. The mean Accomplish- 
ment Quotient for the pupils in Table 8 shows that this 
class has made the progress in the past of which it was 
capable and thus shows both the teacher and her supervisor 
that there is no initial handicap. 

That pupil or class which has an Accomplishment Quo- 
tient of 100 has made satisfactory progress. Consistent with 
health and the need for developing other abilities, the teacher 
should aim to keep the Accomplishment Quotient for read- 
ing as much above 100 as possible. As a rule the teacher's 
first task will be to assist the young gifted children. Due 
to their being classified below their proper level and to 
neglect in general their Accomplishment Quotients are 
usually lower than that of any other pupils. 

In the present stage of mental measurement even the 
Accomplishment Quotient is crude. It is crude because 
both the numerator and denominator of the formula are 
imperfect. There is an immense segment of abilities and 
purposes which should be in an ideal curriculum and which 
are not included in educational age, as at present deter- 
mined. Educational age, as determined at present, is satis- 
factory only to the extent that it is representative of the 
total abilities and purposes which should be the objectives 
of instruction. 

Similarly, the denominator is inadequate. There are 
native equipments other than those usually measured by 
intelligence tests which condition school progress. Teachers 
realize that pupils are not all intelligence or stupidity. 
Some are intelligent and lazy while some are stupid and 
industrious. To the extent that industry, persistence, con- 
scientiousness, etc., are results of instruction they belong 
in the numerator while to the extent they are native equip- 
ment they belong in the denominator. Thus much elaborate 
research is necessary before a thoroughly satisfactory 
Efficiency Quotient can be computed. 



88 How to Measure in Education 

II. Method of Diagnosis 

Method and Function of Diagnosis. — The tree Ig- 
drasil pictures the problem of diagnosis. Carlyle describes 
Igdrasil as the ash-tree of existence which has its roots 
deep-down into the kingdom of hela, whose trunk reaches 
heaven-high, and whose boughs spread over the whole uni- 
verse, a tree which is the past, present, and future, and 
what was done, is doing, and will be done. A central ability 
or purpose in a pupil is a miniature Igdrasil. Its roots reach 
deep-down into the educational conditions of early days, 
and its boughs spread through all his mental life; it shows 
the past, the present, and the future, and what was done, 
is doing, and will be done. 

One bough of reading ability reaches into reasoning prob- 
lems in arithmetic. The initial inventory reveals that the 
ability to solve written problems is defective. It then be- 
comes the business of diagnosis to locate the cause, and 
the cause of the cause, and the cause of the cause of the 
cause, and so on back to the teaching unit. In sum it be- 
comes the task of diagnosis to trace a miniature Igdrasil 
from leaf to root. In the illustration it is the task of diag- 
nosis to discover that the cause of inability to solve problems 
is a defective reading ability, and that the cause of a defec- 
tive reading ability is an inadequate vocabulary and so on. 
Thus the method of diagnosis is to trace abilities to their 
roots by means of standardized tests in order to discover 
just which ability or element of it exists out of standard 
proportions. This is the method of locating the underlying 
causes of defects. 

\/ The function of diagnosis is to guide corrective measures. 
There is an inscription upon the monument which com- 
memorates the arrival of the first white man at the Cumber- 
land River in Central Tennessee. The inscription is to his 
wife and reads thus: ^'She shed a leading light along his 
path of destiny.'^ Diagnosis is the veritable wife of 
remedial instruction. Without its guidance corrective in- 



Measurement in Diagnosis 89 

struction is absolutely "hit or miss," with but one chance 
to hit and several million chances to miss. 

There are an enormous number of diagnoses being made 
in our schools daily. Some of these diagnostic measurements 
are vague and penumbral and some are quite exact. Every 
increase in the accuracy of the diagnostic measurements 
means an increase in the percentage of hits. To make these 
diagnoses accurate requires time, but so does teaching. 
Many teachers do not realize that a large per cent of their 
pupils have not advanced one iota as a result of a year's 
teaching in, say, fundamentals of arithmetic. Diagnosis 
would mean a net saving of time. 

Diagnostic Methods: Introspection by Pupil. — This 
method is so obvious and is so frequently employed that it 
needs neither discussion nor illustration. Pupils frequently 
know not only the exact location of their difficulty but the 
cause of the difficulty as well. When the pupil is able to 
diagnose his own difficulty it is a waste of time and effort 
for the teacher to resort to the more elaborate methods yet 
to be described. Even when the pupil does not thoroughly 
understand his difficulty a conversation with him may give 
the more experienced teacher sufficient data to make a 
diagnosis. 

Diagnostic Methods: Observation of Normal Work. 
— The commonest method of diagnosis is to get some hint 
from the behavior of the subject being diagnosed. When 
school was not in session we three brothers worked in the 
mines with our father. He was particularly expert in diag- 
nosing the condition of the rock under which we worked 
and in detecting the imminence of danger. For this reason 
he was always assigned to the dangerous task of removing 
the last coal which supported the overhanging rock. As 
more and more of the coal was removed the weight of 
the millions of tons of rock slowly settled upon the frail 
wooden timbers. They would become taut like the strings 
of a violin, so that flying splinters caused by the pressure 
made a sort of music. Occasionally a timber would break 



go How to Measure in Education 

with a sharp sound like the crack of a rifle. Through it all 
father worked as though unhearing. Perhaps a week later 
he would say: ''Get your tools, boys, and get out as fast 
as you can." We would go a short distance to a place of 
safety, lie down behind a car so as not to be struck by loose 
objects blown by the wind of the fall, and listen to the 
snapping of the props and the grinding of the mountain. 
As we grew older we, too, learned to interpret hints given 
by the rock. Here as with wild things in the woods it was 
diagnosis or death and diagnosis from subtle behavior 
hints. 

The teacher watches a pupil read who is having difficulty 
with reading. She observes that his eyes do not have three 
or four evenly-spaced brief fixations per line, but move for- 
ward, then jump back again, and act in a generally irregular 
fashion. Observation of this behavior aids the trained 
teacher to make a diagnosis of the difficulty. Another pupil 
is rarely able to complete an assignment in history. By 
observing his study the teacher notes that while reading his 
lesson he screws up his face, shakes his head, moves his lips, 
and tugs at his hair. This, too, is a hint to the perceiving 
teacher. Another pupil is very slow at figures. The dis- 
cerning teacher may construct a trial diagnosis by noting 
that he is counting with his fingers, toes, or tongue, and 
whispering as he adds: ''Seven and six make thirteen, and 
thirteen and eight make twenty-one." Another pupil is 
having trouble with division of fractions. An examination 
of his written work may reveal that the source of his diffi- 
culty is failure to invert the divisor. Thus accurate, de- 
tailed, trained, and experienced observation of pupils in the 
process of normal work is one method of discovering the 
data upon which to base a diagnosis and prescribe corrective 
measures. 

Courtis ^ has listed some arithmetical defects discovered 
by this diagnostic method. Along with the defects he gives 

* S. A. Courtis, Teacher's Manual for the Standard Practice Tests; World Book 
Co., Yonkers-on-Hudson, N. Y., copyrighted 1915. Used by permission of the 
publishers. 



Measurement in Diagnosis 91 

an excellent statement of the underlying causes and suggests 
corrections. 

"i. Child's movements very slow and deliberate, but 
steady. 

''2. Child's movements rapid but variable. Adding 
accompanied by general restlessness, sighs, frowns, and 
other symptoms of nervous strain. 

"3. Child's progress up the column irregular; rapid 
advance at times with hesitation, or waits, at regular or 
irregular intervals. Often gives up and commences a 
column again. 

"4. Child stops to count on fingers, or by making dots 
with pencil, or to work out in its head the addition of cer- 
tain figures. 

^^5. Child adds each first column correctly, but misses 
often on second and third columns. 

''6. Child's time per example increases steadily or irregu- 
larly; particularly after two or three minutes' work; i. e., 15 
seconds each for first five examples, 17 seconds each for 
the next five, 23 seconds for next two, 45 seconds for the 
next example, etc. 

"7. Child's habits apparently good and work steady, but 
answers wrong." 

The diagnosis and correctives follow: 

"i. Slow movements may be due either to bad habits of 
work or to slow nerve action. In the latter case, the diffi- 
culty will prove very hard to control. It is almost certain 
that no amount of training will ever alter the nerve struc- 
ture and so remedy the fundamental cause. But in all such 
cases much can be done to generate ideals of speed, to help 
the child to eliminate waste motions, and to hold himself 
up to his best rate. 

''In any case the procedure would be as follows: Ask 
the child to add the first example alone so that you may 
time him. Give him the signal when to start and let him 
signal when he has finished. Let him make several trials 
of the same example to make sure that he does not improve 
under practice. The teacher should then give the child the 



92 How to Measure in Education 

watch and let him time the teacher in working the same 

example. Comment on difference in child's and teacher's 

times. Then have the child write in small figures all the 

partial sums, as shown in the illustration. The 

— teacher should again time the child, letting him 

30 15 read to himself the partial sums as rapidly as he 

46 can. This will, of course, give the minimum time 
^^ 9 in which the child could possibly add the example. 

4^ o The time records of a child with true defective 

motor control will show slight improvement, if any, 

j^ even with such aid, and probably the only pro- 

5o cedure to follow in such cases is to lower the 
7 standard to correspond. Where there is a marked 

61 difference in time between the original and this last 
performance, the child will get, for the first time 
in its life,' perhaps, a perfectly clear conception of what 
working at standard speed really means, as well as the sen- 
sation of really working at that speed. The teacher and 
child should then practice the same example over and over 
until the child can without the crutches add it at the stand- 
ard rate. Now the teacher can give him the whole test 
again, urging him to work at his best speed and comparing 
his results with the first result. The improvement made by 
ten minutes of this kind of work enables the teacher to say 
that a proper amount of similar study would produce the 
changes desired. 

^' ^But,' some teacher will say, ^will the child not learn 
the example by heart?' This is precisely what is desired. 
A perfect adder has learned so many examples ^by heart^ 
that it is impossible to make up any arrangement of figures 
that will be in any way new to him. The child in the same 
way needs to perfect his control over each example until he 
finally attains to mastery over all. 

"2. If the child gives evidence of nervous strain, check 
his speed, teach him to relax and to work easily and quietly. 
Get good habits of work first, then bring up speed and 
accuracy by degrees. The nervousness of a child is usually 
caused by social conditions, physical health, or tempera- 
mental bias. In any event it is difficult to control. Look 
out for a large fatigue factor in nervous children. 



Measurement in Diagnosis 93 

"3. Irregular speed up the column may be due to either 
of two factors: lack of control of attention, or lack of knowl- 
edge of the combinations. The latter factor will be dis- 
cussed in the following paragraph (4). Attention will be 
considered here. 

^'There is a limit to the length of time that a person can 
carry on any mental activity continuously. As time goes 
on, the mind tends to respond more and more readily to any 
new mental stimulus than it does to the old. The mind 
Vanders' as it is said. The attention span for many chil- 
dren is six additions, for some only three ,or four, for others 
eight, or ten, and so on. That is, a child whose attention 
span is limited to six figures may add rapidly, smoothly, 
and accurately, for the first five figures in the column, giv- 
ing its attention wholly to the work. As the limit of its 
attention span is reached, however, it becomes increasingly 
difficult for it to concentrate its attention. The child sud- 
denly becomes conscious of its own physical fatigue, of the 
sights and sounds around it. The mind balks at the next 
addition; it may be a simple combination, as adding 2 to 
the partial sum, 27, held in mind. It finally becomes impera- 
tive that the child momentarily interrupt its adding activity 
and attend to something else. If this is done for a small 
fraction of a second, the mind clears and the adding activity 
will go on smoothly for a second group of six figures, when 
the inattention must be repeated. 

"It should be evident that these periods of inattention are 
critical periods. If the sum to be held in mind is 27, there 
is great danger that it will be remembered as 17, 37, 26, or 
some other amount, as the attention returns to the work of 
adding. The child must, therefore, learn to 'bridge' its 
attention spans successfully. It must learn to recognize the 
critical period when it occurs, consciously to divert its atten- 
tion while giving its mind to remembering accurately the 
sum of the figures already added. This is probably best 
done by mechanically repeating to one's self mentally, 
'twenty-seven, twenty-seven, twenty-seven,' or whatever the 
sum may be, during the whole interval of inattention. Little 
is known about the different methods of bridging the atten- 
tion spans and it may well be that other methods would 



94 How to Measure in Education 

prove more effective. The use of the device suggested above, 
however, is common. 

^'Giving up in the middle of a column and commencing 
again at the beginning is almost a certain symptom of lack 
of control of the attention. On the other hand, mere inac- 
curacy of addition (as 27 plus 2 equals 28) may be due to 
lack of control over the combinations. If the errors occur 
at more or less regular points in a column, and if, further, 
the combinations missed vary slightly when the column is 
re-added, the difficulty is pretty sure to be one of attention 
and not one of knowledge. 

"4. Hesitation in adding the next figure, when not due 
to attention, is usually due to lack of control of the funda- 
mental combinations. In such cases, however, the hesita- 
tion or mistakes are usually repeated at the same point on 
subsequent additions. The teacher should understand that 
it ^takes time to make mistakes,' and whenever a lengthening 
of the time interval occurs, it is a symptom of a difficulty 
which must be found and remedied. 

"In this case the remedy is not a study of the separate 
combinations. It has been proved ^ that for most children 
time spent in study of the tables is waste effort; that the 
abilities generated are specific and do not transfer. A child 
may know 6 plus 9 perfectly, and yet not be able to add 9 
to 26 in column addition except by counting on its fingers. 
The combinations must be learned^ of course, but they 
should be learned by practicing column addition. Follow 
the method outlined in paragraph (i) above, having the 
column added over and over again until both standard speed 
and absolute accuracy have been attained. 

"5. The sums of a child who is unable to remember the 
numbers to be carried, but whose work is otherwise perfect, 
will usually have the first column added correctly, as well as 
all single columns. Unfortunately, however, inability to 
carry correctly is usually a fault of children with weak 
memories for partial sums in the column. It is well, there- 
fore, to test the carrying habits of any child that is inac- 

^ See Bulletin No. 2, Department of Cooperative Research, Courtis Standard 
Tests, 82 Eliot St., Detroit, Mich. Price 15 cents. See also, Journal of Educational 
Psychology, September, 19 14. 



Measurement in Diagnosis 95 

curate. Many children do not add the number carried until 
the end of the next column; it should, of course, be added 
to the first figure in the column. If necessary the number 
to be carried should be emphasized as by saying, when the 
sum of a column is 27, ^carry 2' to one^s self as the 7 is 
written. This is again a time-consuming device which 
should be adopted only as a last resort. The carrying 
should be an automatic, unconscious operation. Repeated 
practice on a few examples until the same become so per- 
fectly familiar that a child's whole attention may be given 
to establishing correct habits of carrying will prove bene- 
ficial. 

"6. Marked increases in the times required for the suc- 
cessive examples of a test are an indication of a fatigue fac- 
tor in the control of the attention. Some children are 
unable to carry on continuously a single activity, as adding, 
through even a four-minute time interval without a very 
great loss in power. Two courses are open to the teacher, 
one or the other of which is sometimes effective: one is 
to determine the exact length of the interval at which the 
child can work efficiently, and then try to extend the interval 
slightly each day; the other is to set the child at work on 
very long and very hard examples, and to lengthen the 
time intervals to fifteen or twenty minutes' continuous work. 
Difficulties of this type are hard to remedy." 

Diagnostic Methods: Oral Tracing of Process. — - 
There are difficulties the underlying causes of which would 
never come to light from an introspective inquiry on the 
part of the pupil or from mere observation of the pupil's 
normal work. The purpose of the diagnostic process is, of 
course, to induce the pupil to commit some overt act which 
will reveal the invisible causes of his visible defect. When 
neither his ordinary actions nor his written work offers a 
suggestion it is well to have the pupil go through the process 
orally. When I fail to make the class in educational meas- 
urement understand the computation of, say, the median, I 
find it advantageous to ask one of the students who is having 



96 How to Measure in Education 

trouble to come to the blackboard and compute a median 
orally for the class. The cause of the difficulty is thus 
quickly found. 

Uhl used the oral-tracing method to discover the mental 
processes through which pupils go in adding and subtracting. 
The old phrase: "Beat the devil around the stump" accu- 
rately describes how some pupils work. To quote Uhl: * 

''The findings as to methods employed by pupils in 'diffi- 
cult' combinations is both interesting and significant. The 
following methods were found in the work of pupils who 
were tried out in the manner just described. A fourth-grade 
boy showed by slow work that the combination 9 — 7 — 5 
was difficult for him. When questioned, he showed that he 
used a common form of 'breaking-up' the larger digits. In 
working the problem, he said to himself: '9 + 2 + 2 + 2-I- 

1 = 16 and 21.' This shows that the 9 — 7 combination 
was not known, but that the 16 — 5 combination was, in- 
asmuch as he arrived at '21' directly after having combined 
the other two numbers. Another boy of the same grade 
showed the same type of difficulty in a more pronounced 
form. He added 8, 6, and o as follows: 'First take 4, then 
take 2, then add 8, and 4 makes 12, and 2 makes 14.' In 
adding 9, 7, and 5 he said: '9 and 3 is 12 and 4 is 16 and 

2 — 18; and 2 — 20; and i — 21.' He broke into parts 
even so easy a problem as 3 + 4 + 9, adding 9 + 3 + 2 + 
2 = 16. 

"A pupil from the fifth grade presented a quite different 
method of adding. In adding 4, 9, and 6 she explained: 
'Take the 6, then add 3 out of the 4. Then 9 and 9 are 18, 
and I are 19.' Other problems were worked out similarly: 
one containing 3, 9, and 8 was solved as follows: '8 and 8 
are 16 and 3 are 19 and i are 20'; 5, 6, and 9 as follows: 
'6, 7, 8, 9, and 9 are 18 and 2 are 20.' This tendency to 
build up combinations of 8's or 9's continued in the case 
of another problem: 6, 5, and 8 were added thus: '6, 7, 8, 
and 8 are 16 and 3 are 19.' Probably her first problem 
was worked similarly, but I had to have her dictate her 

* W.^ L. Uhl, "The Use of Standardized Materials in Arithmetic for Diagnosing 
Pupils' Methods of Work"; Elementary School Journal, November, 19 17. 



Measurement in Diagnosis 97 

method twice before I understood; she then gave it as 
quoted. 

^'Methods which are quite as clumsy are found in the 
case of subtraction. One boy of the fifth grade was found 
to build up his subtrahend in the case of many problems. 
For example, in subtracting 8 from 37, he increased his 
subtrahend to 10, then obtained 27, and finally added 2 to 
27 to compensate for the addition of 2 to 8. Likewise, in 
subtracting 7 from 30, he added 3 to 7 and proceeded as 
before. This boy knew certain combinations very well, 
but did problems containing other combinations by a method 
much harder than the correct one. 

"Even greater resourcefulness was shown by a fifth- 
grade boy who found the differences between some numbers 
by first dividing, then noting the remainder or lack of one, 
then multiplying, and finally adding to or taking from the 
result as necessary. For example, in subtracting 9 from 
44, he proceeded as follows: 'Nine goes into 44 five times 
and I less; 4 times 9 are 36, minus i equals 35.' That is, 
this boy knew certain multiplication combinations better 
than he did certain subtraction processes; therefore, he 
used multiplication, making adjustments either upward or 
downward as demanded by the problem.'^ 

Diagnostic Methods: Analysis of Test Results. — 
There are many tests specially designed not only to meas- 
ure in the usual sense but to facilitate diagnosis. Mon- 
roe's Diagnostic Tests in Arithmetic is an illustration of such 
tests. Practically every standard test has some diagnostic 
value. 

Using his Reading Scale Alpha 2, Thorndike made an 
unusually subtle analysis of pupil results to discover the 
causes for imperfect comprehension in reading. The fol- 
lowing selected quotations ^ will increase anyone's respect 
for the mental process called reading and will show the prob- 
lem a teacher faces who undertakes to teach or diagnose this 
complex ability. 

" E. L. Thorndike, "Reading as Reasoning: A Study of Mistakes in Paragraph 
Reading"; Journal of Educational Psychology, June, 1917- 



98 How to Measure in Education 



"It will be the aim of this article to show that reading 
a very elaborate procedure, involving a weighing of each 
of many elements in a sentence, their organization in the 
proper relations one to another, the selection of certain of 
their connotations and the rejection of others, and the 
cooperation of many forces to determine final response. In 
fact we shall find that the act of answering simple questions 
about a simple paragraph . . . includes all the features 
characteristic of typical reasoning. . . . 

"In correct reading (i) each word produces a correct 
meaning, (2) each such element of meaning is given a cor- 
rect weight in comparison with the others, and (3) the 
resulting ideas are examined and validated to make sure that 
they satisfy the meaning set or adjustment or purpose for 
whose sake the reading was done. Reading may be wrong 
or inadequate (i) because of wrong connections with the 
words singly, (2) because of over-potency or under-potency 
of elements, or (3) because of failure to treat the ideas pro- 
duced by the reading as provisional, and so to inspect and 
welcome or reject them as they appear. . . . 

"In particular, the relational words, such as pronouns, 
conjunctions and prepositions, have meanings of many de- 
grees of exactitude. They also vary in different individuals 
in the amount of force they exert. A pupil may know 
exactly what though means, but he may treat a sentence con- 
taining it much as he would treat the same sentence with 
and or or or if in the place of the though. 

"The importance of the correct weighting of each element 
is less appreciated. It is very great, a very large percentage 
of the mistakes made being due to the over-potency of cer- 
tain elements or the under-potency of others. . . . 

"To make a long story short, inspection of the mistakes 
shows that the potency of any word or word group in a 
question may be far above or far below its proper amount 
in relation to the rest of the question. The same holds for 
any word or word group in the paragraph. Understanding 
a paragraph implies keeping these respective weights in 
proper proportion from the start or varying their proportions 
until they together evoke a response which satisfies the pur- 
pose of the reading. 



is |l 



Measurement in Diagnosis 99 

"Understanding a paragraph is like solving a problem in 
mathematics. It consists in selecting the right elements of 
the situation and putting them together in the right rela- 
tions, and also with the right amount of weight or influence 
or force for each. The mind is assailed as it were by every 
word in the paragraph. It must select, repress, soften, 
emphasize, correlate and organize, all under the influence of 
the right mental set or purpose or demand. 

"Consider the complexity of the task in even a very 
simple case such as answering question 6 on paragraph D, 
in the case of children of grades 6, 7 and 8 who well under- 
stand the question itself. 

^^John had two brothers who were both tall. Their names 
were Will and Fred. John's sister, who was short, was 
named Mary. John liked Fred better than either of the 
others. All of these children except Will had red hair. He 
had brown hair. 

"6. Who had red hair? 

"The mind has to suppress a strong tendency for Will 
had red hair to act irrespective of the except which precedes 
it. It has to suppress a tendency for all these children . . . 
had red hair to act irrespective of the except Will. It has to 
suppress weaker tendencies for John, Fred, Mary, John and 
Fred, Mary and Fred, Mary and Will, Mary, Fred and Will, 
and every other combination that could be a 'who,' to act 
irrespective of the satisfying of the requirement 'had red 
hair according to the paragraph.' It has to suppress ten- 
dencies for John and Will or brown and red to exchange 
places in memory, for irrelevant ideas like nobody or 
brothers or children to arise. That it has to suppress them 
is shown by the failures to do so which occur. The Will had 
red hair in fact causes one-fifth of children in grades 6, 7 and 
8 to answer wrongly,^ and about two-fifths of children in 
grades 3, 4 and 5. Insufficient potency of except WilP 
makes about one child in twenty in grades 6, 7 and 8 answer 
wrongly with 'all the children,' 'all,' or 'Will, Fred, Mary 
and John.' " 

^ '' Some of these errors are due to essential ignorance of "except," though that 
should not be common in pupils of grade 6 or higher. 



100 How to Measure in Education 

After completing a thorough analysis of results from tests 
of pupils' ability to solve arithmetic problems, Monroe 
diagnosed many of the errors as due to inability to read the 
problems, inability to calculate accurately, and inability to 
reason correctly, which are in turn due to still more funda- 
mental causes. According to Monroe,^ pupils' mental 
processes when reasoning incorrectly are fairly pictured by 
Adams' ^ description of how the canny Scotch pupils solved 
this freak of a problem: ''If 7 and 2 make 10, what will 12 
and 6 make?" The description follows: 

"A look of dismay passed over the seventy-odd faces as this apparently 
meaningless question was read. Everybody knew that 7 and 2 didn't make 
10, so that was nonsense. But even if it had been sense, what was the use 
of it? For everybody knew that 12 and 6 make 18 — nobody needed the help 
of 7 and 2 to find that out. Nobody knew exactly how to treat this 
strange problem. 

"Fat John Thomson, from the foot of the class, raised his hand, and 
when asked what he wanted, said: 

"'Please, sir, what rule is it?' 

"Mr. Leckie smiled as he answered: 

"'You must find out for yourself, John; what rule do you think it is, 
now?' 

"But John had nothing to say to such foolishness. 'What's the use of 
giving a fellow a count ^* and not telling him the rule?' — that's what John 
thought. But as it was a heinous sin in Standard VI (seventh grade) to 
have 'nothing on your slate,' John proceeded to put down various figures 
and dots, and then went on to divide and multiply them time about. 

"He first multiplied 7 by 2 and got 14. Then, dividing by 10, he got 
I 2/5. But he didn't like the look of this. He hated fractions. Besides, 
he knew from bitter experience that whenever he had fractions in his 
answer he was wrong. 

"So he multiplied 14 by 10 this time, and got 140, which certainly looked 
much better, and caused less trouble. 

"He thought that 12 ought to come out of 140; they both looked nice, 
easy, good-natured numbers. But when he found that the answer was 11 
and 8 over, he knew that he had not yet hit upon the right tack; for 
remainders are just as fatal in answers as fractions. At least, that was 
John's experience. 

"Accordingly, he rubbed out this false move into division, and fell back 
upon multiplication. When he had multiplied 140 by 12, he found the 
answer 1680, which seemed to him a fine, big, sensible sort of answer. 

"Then he began to wonder whether division was going to work this time. 
As he proceeded to divide by 6, his eyes gleamed with triumph. 

" 'Six into 48, 8 an' nothin' over, — 2 — 8 — o an' no remainder. I've 
got it !' 

* Walter S. Monroe, Measuring the Results of Teaching, pp. 154-172; Houghton 
Mifflin Co., N. Y., 1918. 

^ John Adams, Exposition and Illustration in Teaching, pp. 176-178. 
^^ Scotch: Any kind of arithmetical exercise in school. 



Measurement in Diagnosis loi 

"Here poor John fell back in his seat, folded his arms, and waited 
patiently till his less fortunate fellows had finished. 

"James" knew from the 'if at the beginning of the question that it must 
be proportion; and since there were five terms, it must be compound pro- 
portion. That was all plain enough, so he started, following his rule: 

"'If 7 gives 10, what will 2 give? — less.' 

"Then he put down 

7 : 2 : : 10 : 

"Then if 12 gives 10, what will 6 give? — again less.* So he put down 
this time 

12 : 6 

"Then he went on loyally to follow his rule: multiplied all the second 
and third terms together, and duly divided by the product of the first 
two terms. This gave the very unpromising answer i 3/7. 

"He did not at all see how 12 and 6 could make i 3/7. But that wasn't 
his lookout. Let the rule see to that." 

Diagnostic Methods: Developmental History. — De- 
velopmental history is as useful a method for educational 
diagnosis as for medical or mechanical or any other form 
of diagnosis. Go to a doctor with an obscure physical defect 
and he will enquire about your total past. An automobile 
repairman asks you to relate just what you did to the car 
to put it out of order. Take a mentally defective child to 
a psychologist and he will comb the child's history to see 
if something in that past may not suggest a diagnosis. The 
developmental history not only goes back to the pre-natal 
environment of the child, but to the life of the parents and 
grandparents. Many fundamental educational defects are 
not of recent origin. They have been cumulative. They 
have remained unnoticed for years. Their roots reach far 
back into the past. A successful diagnosis requires that 
these roots be traced back to their origins. 

Diagnostic Methods : Contrast of Opposites. — Fre- 
quently a teacher does not succeed at a diagnosis simply 
because she does not know what are the customary causes 
of defects in the ability in question. Suppose, for example, 
that a pupil is not making satisfactory progress because 
his method of work is inefficient. A teacher who does not 
know what methods are and are not efficient is not likely 
to succeed with this diagnosis. 

"The clever boy of the class. 



102 How to Measure in Education 

A diagnostic method which will help inexperienced 
teachers is to contrast opposites. The contrast may be 
between the best and poorest of the class, of pupils in one 
grade with pupils in a lower or higher grade. This method 
is to observe the two or three most successful pupils at their 
work and immediately after to observe the two or three 
most unsuccessful pupils, or to have both groups trace the 
process orally, or to test both groups and analyze the re- 
sults, or to use any other of the diagnostic methods upon 
both groups at the same time. Diagnosis by contrasting 
opposites will throw in relief the differences between com- 
petent and incompetent pupils and will thus facilitate 
diagnosis. 

Diagnostic Methods: Complete Analysis of Ability. 
— A complete and thorough analysis of the sensory, mental, 
and motor processes involved in a given ability is the last 
resort of the diagnostician. It is the last resort because it 
is time consuming, and because if it fails the diagnostician 
can do nothing further. A complete analysis usually re- 
quires the combined use of all the previously described 
diagnostic methods. It utilizes data gleaned from the 
child's introspections, from observations of his normal work, 
from the child's oral tracing of the process, from analyses 
of test results, and from a developmental history. 

The technique of diagnosis has been illustrated for arith- 
metic and reading. The last illustration is for spelling. An 
excellent illustration for a complete analysis of a school 
ability is that of Hollingworth and Winford.^^ 

"THE PSYCHOLOGICAL EXAMINATION OF POOR 

SPELLERS 

By Leta S. Hollingworth, Assistant Professor of Education, 

Teachers College 

"It is virtually impossible for an educated adult, whose 
spelling habits have long ago become automatic, to recon- 

12 Leta S. Hollingworth and C. A. Winford, "The Psychology of Special Disa- 
bility in spelling"; Teachers College Record, March, 1919- 



Measurement in Diagnosis 103 

struct from introspection the long, difficult, and complex 
processes through which he passed in learning to communi- 
cate by means of correctly spelled words. Such an adult 
may gain some idea of what is involved in the spelling 
process by confronting himself with the task of learning to 
spell and write words upside down and backwards, but even 
so the experience of the child is far from duplicated. 

"In casting about for material from which to elaborate 
the analysis of spelling, upon which the psychological exam- 
ination of poor spellers must be based, it is found that two 
main sources of information are available. In the first 
place, we may make controlled observations of children of 
various ages, who are actually engaged in forming the bonds 
involved in spelling. In the second place, we may observe 
experimentally the behavior of those neurological cases, 
which are characterized by selective loss or enfeeblement of 
bonds once well established. 

"Such observations teach us that the aspect of linguistic 
attainment, v^^hich we call spelling, is by no means a simple 
process, consisting merely in the functioning of a single bond 
or kind of bonds between a given stimulus and a given re- 
sponse. The process of learning to communicate by means 
of correctly spelled words ordinarily involves the formation 
of a series of bonds approximately as follows: 

"(i) An object, act, quality, relation, etc., is 'bound' to a certain 
sound, which has often been repeated while the object is pointed at, 
the act performed, etc. In order that the bond may become definitely 
established, it is necessary (a) that the individual should be able to 
identify in consciousness the object, act, quality, etc., and (b) that he 
should be able to recollect the particular vocal sounds which have been 
associated therewith. 

"(2) The sound (word) becomes 'bound' with performance of the 
very complex muscular act necessary for articulating it. 

"(3) Certain printed or written symbols, arbitrarily chosen, visually 
representing sound combinations, become *l30und' (a) with the recog- 
nized objects, acts, etc., and (b) with their vocal representatives, so 
that when these symbols are presented to sight, the word can be 
uttered by the perceiving individual. This is what we should call 
ability 'to read' the word. 

"(4) The separate symbols (letters) become associated with each 
other in the proper sequence, and have the effect of calling each other 
up to consciousness in the prescribed order. When this has taken place 
we say that the individual can spell orally. 

"(S) The child by a slow, voluntary process 'binds' the visual per- 
ception of the separate letters with the muscular movements of arm, 
hand, and fingers necessary to copy the word. 



104 How to Measure in Education 

"(6) The child 'binds' the representatives in consciousness of the 
visual symbols with the motor responses necessary to produce the 
written word spontaneously, at pleasure. 

"This analysis may not be exhaustive, but it provides a 
foundation on which to construct a scheme for the psycho- 
logical examination of poor spellers. Obviously, poor spell- 
ing may be due to one or another of quite different defects, 
or to a combination of several defects. In an ability so com- 
plex as this there is opportunity for the occurrence of a 
great variety of deficiencies. In any particular case the 
underlying cause can be discovered only by means of a 
psychological examination covering the various mental 
processes involved. The following outline is based on ex- 
perimental teaching done at Teachers College during the 
academic year 1916-19170^^ 

''i. Poor spelling may be due to sensory dejects, either 
of the ear or of the eye. If sounds are indistinct, or visual 
stimuli are vague or distorted, the bonds involving these 
sensations will be difficult to form. Thus tests of auditory 
and visual acuity must be given. If any sensory defect is 
revealed, it should be corrected, if it is corrigible. The 
necessary tests are described by authors of clinical manuals. 

"2. The quality of general intelligence must be deter- 
mined. Failure to spell may be simply one manifestation 
of general intellectual weakness. For this purpose one of 
the general intelligence scales is to be used. The Stanford 
Revision of the Binet-Simon Scale is used by the present 
writer. 

"When we have excluded sensory defects and general in- 
tellectual deficiency from the picture, there remains the fol- 
lowing possible causes of difficulty: 

"3. The bonds which are described in our analysis under 
(2) may be inadequately or incorrectly developed. This 
would be jaulty pronunciation. This is undoubtedly a very 
prohfic cause of poor spelling. Such errors as 'a-f-t-e-r- 
w-o-o-d-s' for 'afterwards,' 'w-h-e-n-t' for 'went,' 'p-r-e- 
h-a-p"S' for 'perhaps,' and 'f-a-r-t-h-e-r' for 'father,' will 
serve to illustrate this point. In our observations on poor 

13 L. S. Hollingworth and C. A. Winford, "The Psychology of Special Disability 
in Spelling"; Teachers College Contributions to Education, No. 88, 19 18. 



Measurement in Diagnosis 105 

spellers we found such errors by the score, and discovered 
that the words were pronounced as spelled. Thus the poor 
speller should be tested for the pronunciation of the words 
which he misspells. It may be that drill in correct pronun- 
ciation is what is needed in order to improve his spelling. 

^'Faulty pronunciation may itself be due to various causes. 
In the majority of cases it doubtless arises from false 
auditory perception, as in such misspellings as 'hares breath' 
for ^hair's breadth/ or 'Mail Brothers' for 'Mayo Brothers.' 
In other cases it arises from inability to articulate properly, 
as with children who stammer or lisp, or have nasal obstruc- 
tions. 

"4. It may be that the weakness lies in the formation 
of bonds, which we have noted in our analysis under (3). 
The formation of these bonds involves visual perception, 
which we found to be of first-rate importance in spelling. It 
has been known for some time that in reading, perceptual 
factors play a chief role. We discovered that among poor 
spellers error is not distributed at random, but follows cer- 
tain laws. For instance, there is a constant tendency to 
shorten words slightly in misspelling them; the influence of 
any letter over error varies greatly with the position of the 
letter in the word; the first half of a word has a very great 
advantage over the last half. From these and other facts 
it is apparent that weaknesses in visual perception con- 
tribute to the failures of many poor spellers. In order to 
determine whether such is the case with any particular child, 
it will be necessary to make an analysis of his work, to see 
whether his errors reveal perceptual weaknesses. If a child 
can spell the first halves of words correctly, but does not 
spell the last halves correctly, or if he learns to spell the 
tops of words correctly, but cannot spell the bottoms of 
them, the remedy is to bring about readjustments of atten- 
tion, whereby he will look at those portions of words, which 
formerly he failed, unconsciously, to see. 

"5. Poor spelling may be due to sheer failure to remem- 
ber — failure to retain impressions which were originally 
clearly and correctly perceived. This may mean simply that 
the child requires an unusually large number of repetitions 
before he can form the bonds described under (4) in our 



io6 How to Measure in Education 

analysis; or it may be that his memory span is abnormally 
short and that he cannot easily associate more than three or 
four elements together as a unitary sequence. Tests of 
memory span for various kinds of materials should be insti- 
tuted in order to gain light on this point. If it appears that 
his performance is decidedly below the normal for his age, 
especially when the material is letters, it may be concluded 
that too brief memory span is probably playing a part in his 
difficulties. This could be checked up further by an analysis 
of his spelling, to see to what extent he spells short words 
correctly, but misspells longer words. In cases where the 
memory span is brief, emphasis upon syllabication, prefixes, 
suffixes, and other short units should be helpful. The child 
might be able to remember three syllables of three letters 
each, but might be totally unable to retain one word of nine 
letters. Psychologically these two tasks are very different 
indeed. 

^'6. Smedley suggested years ago that there might be a 
'rational element' in spelling, whereby knowledge of the 
meaning of words would contribute to the correct spelling 
of them, in and of itself. Bonds involving meaning are con- 
sidered in our analysis under (i). In our experimental 
work we found that children produce many more misspell- 
ings in writing words of the meaning of which they are 
ignorant or uncertain, than they produce in writing words 
the meaning of which they know. Hence it is of interest 
to test the child for knowledge of the meaning of words 
which he misspells. It is necessary to find out whether the 
words which trouble him are in his vocabulary. It may be 
that the misspellings which he produces are without content 
to him. Surely it is conceivable that the absence of a con- 
cept might detract from success in arranging the garment 
in which it should be clad. 

"7. Motor awkwardness and incoordination may con- 
tribute to poor spelling. Here are involved the bonds dis- 
cussed by us under (5) and (6). In written spelling (with 
which education is chiefly concerned, since there is but little 
use for oral spelling in practical hfe), it is necessary not 
only to know what symbols are required, but to execute them 
successfully, with arm, hand, and fingers. Here we must 



Measurement in Diagnosis 107 

have recourse to motor tests for steadiness, coordination, 
and speed of voluntary movement. Occasionally one finds 
a child who does much better at oral spelling than he does 
at written spelling. In such cases improvement in hand- 
writing is what is needed, either in rate or in quality. A slow 
writer may misspell many words if he attempts to hurry. 

''8. In the course of our observation we perceived that 
many of the mistakes of poor spellers are simply lapses. 
These are errors committed by children who 'know better,' 
who can correct the mistake spontaneously as soon as atten- 
tion is called to it. There are wide individual differences in 
the liability to lapse. It is difficult to see what remedial 
measures may be taken to improve those whose disability is 
due largely to lapsing, since lapses are not only involuntary, 
but for the most part unconscious; there is no awareness 
of them until their primary memory has been lost. 

''One might suggest tentatively that children who show 
this tendency in marked degree should be trained to lay 
asi'^e for a few minutes all written communications; then to 
take up their work and look carefully at each word in order 
to correct all lapses. It is not known experimentally how 
long an interval must elapse in order that writing may 'get 
cold,' so that lapses may be detected by the author of them. 
A few minutes will probably suffice. 

"9. Transfer of habits previously acquired is occasion- 
ally the cause of misspelling. We found, for example, one 
poor speller, who had previously learned in a phonetic lan- 
guage. He carried over this habit into English spelling, and 
it was very difficult for him to adjust himself. The possible 
existence of such an influence is to be determined by taking 
the child's school history. 

"10. Sometimes it happens that the errors of a child are 
largely of one particular kind. Such idiosyncrasies may be 
exemplified by the case of a child who had a strong tendency 
to add final 'e' to all words; and by the case of another who 
was addicted to intrusive consonants, especially 'm' and 'n.' 
These idiosyncrasies may doubtless be traced to their source 
in every case by a patient analysis of the mental contents of 
the child. The cause of error will be different in every case. 
It is impossible to generalize about idiosyncrasies. 



io8 How to Measure in Education 

"ii. After all of the foregoing factors have been con- 
sidered, there still remains the possibility that the failure 
to learn is due wholly or partially to temperamental traits 
— indifference, carelessness, lack of motivation, distaste for 
intellectual drudgery. English spelling calls largely for rote 
learning. It can be acquired only by the formation of thou- 
sands of specific bonds, arbitrarily prescribed. Its pursuit 
is extremely tedious at best. Thus many children will be 
temperamentally ill adapted to become good spellers. 

^'Disability in spelling may result from any one of the 
defects which we have outlined, or from any combination 
of two or more of them. It is apparent that the psycho- 
logical examination of a poor speller is neither a brief nor 
a simple task. The direct examination of the individual 
should furthermore be supplemented by a family history, a 
developmental history, and a school history. In some cases 
special defect in spelling appears to be hereditary. Stephen- 
son,^* for instance, has reported six cases of inability to read 
and spell, which occurred in three generations of one family. 
In some of our own cases relatives have been affected with 
linguistic disabilities. 

''A developmental history will reveal whether the child 
was backward in speech, whether he has or has had any 
speech defects, and whether he has been affected by any 
illness that might conceivably have produced localized 
leisons in the central nervous system, or have affected lin- 
guistic ability in any other way. One of our cases, for 
example, had suffered a paralysis of the soft palate following 
diphtheria, which had for some time interfered with articu- 
lation. Another had a history of having been tongue-tied 
till he was eight years old. Such facts may be of consider- 
able interest in a given case. 

^'A school history is essential in order to determine 
whether progress in school subjects other than spelling has 
been normal, whether learning in a language other than 
English has taken place, and whether the disability has 
operated to cause general retardation in school status. 

''The question naturally arises as to whether the difficulty 
is always remediable when located. It is quite possible that 

^* S. Stephenson, "Six Cases of Congenital Word-Blindness Affecting Three 
Generations of One Family"; Opthalmascope, August, 1907. 



Measurement in Diagnosis 109 

there exist cases where the necessary bonds cannot all be 
formed, even with the maximum of practice and effort. 
Experimental teaching has not yet been undertaken to an 
extent which would give the answer to the question. In 
those rare cases where the disability is very extreme, in a 
child of good general capacity, it is probably wise to make 
some special provision for oral recitation and examination 
and thus to allow tJie child to make progress in school, 
rather than to keep him back year after year on account of 
his disability." 

Prerequisites of Skill in Diagnosis. — Success as a 
diagnostician requires: (i) A knowledge of the usual 
causes of usual defects in the various abilities developed 
by the school. (2) Eyes to see and training or experience 
to interpret subtle behavior as evidence of the operation of 
known causes. (3) A technique which will bring otherwise 
invisible hints to the surface. (4) A knowledge of what 
remedial measures to prescribe for a given diagnosis. 

Summarized below are certain basic causes which are re- 
sponsible for many defects and whose operation is not 
confined to any one subject. Just as pestilences can usually 
be traced back to a few sources, so many diagnostic traits, 
irrespective of the abilities from which they start, lead 
back to a few basic causes, especially when the defect 
being diagnosed is an annoyingly persistent one. Before 
anyone attempts diagnosis he sKould have a knowledge of 
the more common fundamental breeders of ability defects. 

Insufficient practice. — In some pupils a given ability does 
not function at all, simply because they have never studied 
to develop the ability, or it functions imperfectly because 
they have not had enough study and practice. This condi- 
tion need cause no special concern for it is easily remedied. 
The time for real concern comes when a normal amount of 
study and practice fails to eliminate the absence or imper- 
fection of functioning. 

Improper methods of work. — There may be an optimum 
method of work. Pupils who differ in type or temperament 



no How to Measure in Education 

may require different methods or again there may be an 
optimum method for all pupils. At any rate many pupils 
are working below par because they are employing ineffec- 
tive methods. 

A special case of improper methods of work occurs in 
those abilities where speed and quality are intimately re- 
lated. It may be that a pupiFs ability is functioning 
imperfectly because it is functioning either too speedily or 
not speedily enough. 

Deficiency in fundamental skills. — Deficiency may mean 
either absence of sufficient skill or absence of sufficient 
transfer of skill to the new situation or both. The mental 
processes cannot flower into appreciation of hterature, nor 
is the mind free to reflect upon the great principles of his- 
tory, geography, science, mathematical problems and other 
higher stages in education until the underlying skills are 
both made automatic and transferable. The youth who does 
not come from a cultured home and whose learning has 
been hastily grafted on an ignorant home training, is barely 
conscious of his own ideas when addressing a cultured 
audience and scarcely enjoys what he eats when dining with 
a cultured family. All his attention is concentrated upon 
watching lest he ''gabble like a goose," or upon observing 
lest he use the wrong spoon. The pupil who stumbles in 
his reading halts in his history. The remedy is to make the 
basic skills automatic. 

Absence of interest. — The importance of interest or pur- 
pose in developing ability cannot easily be over-emphasized. 
There are more failures due to failure of interest than this 
world dreams of. 

Physical defects. — The diagnosis of any ability should 
carefully consider physical factors. In the case of many 
pupils food for their minds will not facilitate their school 
progress nearly so much as food for their stomachs. No 
diagnosis should omit a careful examination of sense organs, 
particularly the eyes and ears. Just as "rivers of mercy 
do not flow into the world through rye-straws," so we do 



Measurement in Diagnosis iii 

not have an educational flood when knowledge and experi- 
ence must trickle through choked sense organs. Instruction 
cannot possibly be more than 50 per cent efficient when the 
child hears only 50 per cent of what is said to him and 
sees only 50 per cent of what he looks at. 

Again, diagnosis should consider the condition of the 
pupil's response mechanism. What goes in through the 
sense organs must come out through the response organs 
before the educative cycle is complete. More improvement 
in molding, drawing, painting, writing, manual arts and 
sports might conceivably be secured through correction of 
defects of muscular coordination than through direct in- 
struction in the abilities in question. 

"Thy body at its best 
How far can that project thy soul on its lone way." 

Subnormal intelligence, — Low native intelligence is the 
preeminent cause of ability defects. Intelligence is the very 
tap-root of Igdrasil. Just as injury to the tiny pituitary 
body causes stunted stature, marked adiposity, imperfect 
sexual development and other profound changes, so a de- 
fective intelligence casts its blight upon many or all abilities. 
Because of its ubiquity and its probable unimprovability, 
this cause of defects has special significance. Its importance 
is not always understood by the superficial diagnostician, 
because the superficial diagnostician does not carry the 
process of diagnosis far enough. Unsatisfactory work in 
history may be traced to imperfect reading ability. But 
why is the reading ability imperfect? In many instances 
it will be found that reading ability is imperfect because 
of low native intelligence. Whenever retardation is general, 
and whenever there is relative unimprovability, it is well to 
test for intelligence. 



CHAPTER IV 
MEASUREMENT IN TEACHING 

I. The Use of Practice Tests 

Practice Tests Described.— After the initial tests 
are given the teacher should teach reading according to the 
best pedagogical procedure. This does not mean that 
measurement should be eliminated at this stage, for teaching 
and testing should always be continuously intermingled. In 
the case of the fundamental skills it is advisable to teach 
almost entirely by testing. Practice tests in arithmetic, 
handwriting, etc., make this possible. At the time of writ- 
ing, practice tests have not been constructed in the field of 
reading. Until such tests are constructed it would be advan- 
tageous to have pupils read material cleverly selected so 
pupils will manifest in some overt fashion the extent to 
which they comprehend the material being read. Effective 
teaching requires this constant measurement of compre- 
hension. 

A knowledge of the fundamental psychology of learning 
is necessary but not always sufficient for effective teaching. 
It is frequently difficult to translate principle into practice. 
A detailed translation would require several volumes. Only 
enough space can be spared to describe three important con- 
tributions of educational measurement toward bridging this 
gap between principles and practice, namely, practice tests, 
informal tests, and standardized scales. 

Though practice tests are in use in both elementary and 
high schools ^ it will suffice to describe a typical elementary- 
school practice test — Courtis Standard Practice Tests in 

^ I. M. Allen, "Experiments in Supervised Study," School Review, June, ipip- 

112 



Measurement in Teaching 113 

Arithmetic.^ A set of these practice tests consists of 48 
stiff cards which make 48 lessons. Each lesson, except les- 
sons 13, 30, 31 and 44 which are test cards, and lessons 45, 
46, 47 and 48 which are study cards, contains just one type 
of example. The lessons begin with simple examples and 
gradually become more complex, each additional lesson rep- 
resenting just one additional difficulty. When the pupil has 
mastered the forty lessons, he has mastered all the difficul- 
ties in the addition, subtraction, multiplication, and division 
of whole numbers. There is one set of practice lessons for 
each pupil. 

Along with the practice lessons comes a Student's Practice 
Pad for each pupil. The practice pad contains sheets of 
tissue paper. The pupil inserts a lesson card into the pad 
and under a sheet of tissue paper. This permits the pupil 
to see the example and at the same time do all work on the 
tissue paper, thus enabling the lesson card to be used from 
year to year. The student's practice pad also contains 
sheets upon which a pupil can keep a daily tabular and 
graphic record of achievement and progress. 

Along with both practice lessons and practice pad comes 
a Teacher's Manual, which gives detailed instructions for 
the proper use of practice lessons and practice pads and 
warnings against their improper use by over-zealous teach- 
ers. The manual also gives much helpful advice about 
how to diagnose and remedy pupil defects in the four funda- 
mental processes. The manual also contains record sheets 
which enable the teacher to keep a continuous record of 
each pupil's work. 

The essential steps in the procedure of using these prac- 
tice tests follow: 

I. All pupils are given test card 13 which contains all 
the difficulties found in lessons i to 13. Each pupil slips 
the test card, examples up, under the topmost sheet of tissue 
paper in his practice pad. At the signal all begin work 
and continue until the signal is given to stop. 

'Sold by World Book Company, Yonkers, N. Y. 



114 liow to Measure in Education 

2. Pupils exchange papers and score each other as the 
teacher calls the correct answers. 

3. All pupils who make satisfactory scores are excused 
from lessons i to 13. Sometimes the test is given twice to 
make results reliable. Sometimes the excused pupils may 
do something else until the backward pupils catch up or 
they may take the next test and the next until a point is 
reached where they need to study. 

4. All pupils not excused from drill take lesson i. If 
they make a satisfactory score on lesson i, the next day they 
take lesson, 2, and so on. 

5. Those who fail on lesson i continue studying it and 
taking it until a satisfactory score is made. 

6. As soon as a pupil finishes lesson 12 he takes test 13 
again as final proof of his mastery of the preceding lessons. 
He may work on something else until the others catch up 
or he may proceed. 

7. As soon as about 90 per cent of the class, including 
those who originally passed, have finished test 13, they take 
test 30. Those who pass test 30 are excused, and those 
who do not, drill upon lessons 14 to 30 as described above. 

8. The teacher keeps a daily record of what each pupil 
achieves, watches to see that there is no cheating, makes 
diagnoses and applies remedies where they are needed and 
only where they are needed, stimulates good w^ork on the 
part of all, sees that pupils keep their own records in good 
condition, and occasionally rescores the pupils' papers in 
order to keep their standard of scoring high. 

All the regular lesson cards have answers on the back, 
hence pupils may score themselves or each other by simply 
turning the lesson card over and reinserting it under the 
tissue paper. The teacher's attention is thus freed for the 
real work of individual instruction, since no papers are 
handed in to her except those which the pupil himself judges 
to be perfect. 

Practice Tests Individualize Instruction — Mass in- 
struction is highly inefficient, and this is particularly the 



Measurement in Teaching 115 

case with skills. The interests of study, instruction and 
supervision are identical. All focus upon study. Study is 
highly individual. Instruction must be equally individual 
if it is to be efficient. Mass instruction aims at everybody. 
It frequently hits nobody. 

The amoeba has three types of reactions produced by 
three types of stimuh. There are, first, positive stimuli in the 
form of satisfying food and the like. The amoeba reacts 
by advancing toward these stimuli. The teacher uses posi- 
tive stimuli to attract pupils toward good habits of work. 
There is, second, negative stimuli to which the amoeba 
reacts by retreating. The teacher uses negative stimuli to 
drive the pupil out of bad habits of work. There is, third, 
neutral stimuli which produce neutral reactions in the 
amoeba, for neutral stimuli do not stimulate at all and 
neutral reactions simply mean no reactions at all. It is 
the teacher's ambition to become so efficient that every 
word she speaks or move she makes will be a positive or 
negative stimulus depending upon her choice. But in mass 
instruction most of the stimuli are neutral stimuh. Our pro- 
fessor of literature was right when he said that teaching 
the class was ''like trying to pour water from a gallon bucket 
into small-necked bottles." Most of his stimuli were neu- 
tral, partly because of lack of capacity on our part, partly 
because he was employing stimuli which were neutral to 
most, negative to some, and positive to only a few. Indi- 
vidual differences are so great that wherever possible mass 
instruction should give way to individual instruction. Prac- 
tice tests are a device for individualizing instruction. With- 
out the aid of some such device individual instruction is 
impracticable. 

Practice tests automatically adapt the work to the ability 
of each pupil and thus enable each pupil to begin at that 
point which means neither reteaching nor premature teach- 
ing. This is accomplished by means of the initial inventory 
tests. Test 13 serves this function in the case of the Courtis 
Practice Tests. 



ii6 How to Measure in Education 

Practice tests permit each pupil to work according to his 
own methods and help him to find his best method. It is 
surprising how varied are the methods by which pupils 
learn such narrow functions as addition, subtraction, multi- 
plication, and division. Kirby ^ has shown not only that 
what is the best method for one pupil is not always the best 
method for another, but also that pupils frequently do not 
discover their best method and best rate of work until they 
are under the pressure of raising their score. 

Finally, practice tests permit each pupil to advance at 
his own rate. Every study of the varied rates of progress 
for pupils in the same class has revealed the need of some 
teaching method which makes provision for individual differ- 
ences in this respect. 

Practice Tests Strengthen the Purpose to Improve. 
— Practice tests motivate the learning process by making 
visible both distant and immediate goals and by providing a 
method whereby a pupil can measure his rate of progress 
toward these goals. Every pupil keeps a record of each 
day's achievement and draws a graph showing his progress. 
These provisions motivate through their appeal to basic 
instincts. The instinct of rivalry is so strong that work is 
turned into play by the simple process of introducing into 
it this element of rivalry. Practice tests not only make 
possible a rivalry between individuals, which is probably 
the world's most ubiquitous form of motivation, but they 
also make possible higher types of rivalry, namely, rivalry 
with one's own past record, and the rivalry of one group 
with another. 

This provision of practice tests for the keeping of scores 
is prerequisite both to intense effort and real happiness in 
school work. The games at which both children and adults 
work hardest and are happiest are invariably games where 
a score is kept. Generally speaking conventional education 

^ Thomas J. Kirby, Practice in the Case of School Children; Teachers College, 
Columbia University, New York, 19 13. 



Measurement in Teaching 117 

does not keep scores. A sort of score is occasionally re- 
ported, but these scores are purely relative. They do not 
show how much each individual has surpassed his previous 
record. They show which pupil is relatively best and so on 
to poorest. What stimulus is that to pupils who know they 
cannot hope to outstrip a more capable competitor? And 
what stimulus is it to the victor who knows that victory 
comes without much exertion due to his native superiority? 

Practice tests motivate learning by throwing responsi- 
bility for promotion, or the attainment of the goal, upon 
the pupil. Every idle minute puts off the day when the 
goal will be reached and every industrious moment hastens 
the coming of the day, and what is important, the pupil is 
made to clearly perceive this intimate relation and is forced 
to recognize the fairness and justice of it. Just as certain 
as a pupil idles he will be punished and just as sure as he 
works he will be rewarded. 

Practice Tests Secure a Maximum of Exercise. — 
The second fundamental law of learning is, according to 
Thorndike, the law of exercise. When purpose is strong or 
when the law^ of effect is appropriately utilized and when 
exercise is abundant we have the optimum conditions for 
rapid progress. Here is the way I once taught addition to 
a class of forty pupils. 

"Will each pupil copy on a sheet of paper the addition 
examples which I shall read to you, five examples to the 
row?" And then, 

''Mary, you give orally the answers to the examples in 
the first row." And then, 

"John, you take the second row," etc. 

Each patiently or mischievously, according to his nature, 
waited until his turn came to begin. Only one pupil's 
neurons were exercising at a time, because I told each 
one just exactly where the preceding one stopped. Subse- 
quent observations of other teachers have shown that my 
stupidity was not an isolated case. This one-out-of -forty 



ii8 How to Measure in Education 

sort of exercise is quite common. Had I used modern prac- 
tice tests, probably without knowing it, I would have multi- 
plied my efficiency just forty times. 

Practice Tests Facilitate Aid and Diagnosis. — Prac- 
tice tests bring swift aid to the pupil who needs it, and 
prevent teaching when it is not needed. Effort expended 
which brings no return in terms of progress brings discour- 
agement. When discouragement reaches a certain stage 
effort ceases. Under ordinary conditions pupils sometimes 
remain for years undiscovered in the Slough of Despond. 
When the pupil's curve of progress ceases to rise to reward 
his effort, a teacher is needed. For the teacher to help at 
any other time would probably be to waste her time and 
injure the pupil. When to teach is instantly revealed by 
the curve of progress graphed by the pupil. 

Practice tests facilitate diagnosis. Successful diagnosis 
requires the teacher to discover the exact location of the 
difficulty and the exact cause of the difficulty. Like tracer 
bullets, the pupiFs daily scores leave behind a fiery trail 
which instantly reveals the location of the difficulty. The 
very following of this trail helps to eliminate probable ex- 
planations and thus facilitates diagnosis. 

The chief danger from practice tests is not that they will 
cause too much emphasis upon drill, because the accompany- 
ing manuals allot a conservative time and constantly urge 
teachers not to exceed this time. The chief danger is that 
teachers will consider practice tests as something apart, so 
that the abilities developed by them will not function in 
life situations. The use of practice tests should grow out 
of genuine situations and should be continually associated 
with genuine situations. There comes a time in the execution 
of projects where the pupil realizes that his skill is inade- 
quate. It is the function of practice tests to repair this 
inadequacy in the most economical and interesting way. 



Measurement in Teaching 119 

II. The Use of Informal Examinations * 

Importance of Examinations-There are in the 
United States about 700,000 elementary school, high school, 
and college teachers. It is a conservative estimate that each 
teacher gives on the average twenty examinations a year. 
This makes 14,000,000 examinations each year. The time 
required to construct, give, and score each examination will 
average, say, three hours. This means that annually about 
42,000,000 hours are spent examining pupils. Even though 
our estimate is doubly generous, the hours would still be 
sufficient to show the enormous importance of examinations. 
Without a doubt, examinations are and will be for some 
time and may possibly always remain the most important 
form of educational measurement. Since this is so, it may 
seem that those of us who are interested in educational meas- 
urement have, in our enthusiasm for constructing and 
standardizing tests, neglected the traditional type of educa- 
tional measurement. Really, however, this has not been 
neglect on our part, for standard tests are nothing but 
improved examinations. Furthermore, we have been learn- 
ing new techniques which will in time react to improve the 
making of examinations. The purpose of the next few pages 
is to show teachers how they may make use of one of these 
new techniques of scientific testing not only to improve 
certain kinds of examinations, but also to make examina- 
tions a real pleasure instead of an onerous task to both 
teacher and pupils. 

Sample True-False Examination.— The scattered ex- 
amination shown below is designed to test a pupil's knowl- 
edge of certain facts concerning the physical features of the 
United States. In actual practice a teacher will usually test 
on a much narrower topic. We have purposely written this 
examination hastily in order that it might illustrate certain 
crudities of construction. Any teacher in the elementary 

* A modified form of this section first appeared thus : Wm. A. McCall, "A New 
Kind of School Examination"; Journal of Educational Research, January, 1920. 



120 How to Measure in Education 

school could do as well and most teachers could do better. 
The same technique is equally useful to high school and col- 
lege teachers. 

The examination as presented here assumes that the state- 
ments whose truth and falsity are to be determined by the 
pupils have been mimeographed so that a copy of the 
examination could be placed in the hands of each pupil. The 
sample examination given below has been worked through 
by a pupil and been scored by a pupil or the teacher. The 
underlining was done by a pupil. The check, cross and zero 
mean respectively that the pupil's answer is correct, incor- 
rect or omitted. Only enough of the examination is shown 
below to illustrate the procedure. 

SAMPLE EXAMINATION ON UNITED STATES 

Some of the following twenty statements are true and some are false. 
When the statement is true draw a line under True; when it is false draw 
a line under False. Be sure to make a mark for every statement. If you 
do not know, guess. 

1. In general the mountain ranges run east and 

west. 

2. Most of the rivers flow north. 

3. Mt. Mitchel is the highest point east of the Mis- 

sissippi River. 

4. Mt. Washington is higher than Mt. Mitchel. 

5. The Catskill Mountains are in Maine. 

6. The Cascade Mountains are nearer the Pacific 

Ocean than the Rocky Mountains. 

7. The Rocky Mountains are nearer the Pacific 

Ocean than the Appalachian Mountains. 

8. The Blue Ridge is in the Rocky Mountains. 
g. There are more active volcanoes in the west than 

in the east. 

10. "Old Faithful" is the name of a cyclone which 

sweeps upward from Texas into Oklahoma. 

11. The "Grand Canyon" was cut through the Cum- 

berland Plateau by the Susquehanna River. 

12. Pike's Peak is in the Rocky Mountains. 

13. The Mississippi River flows into the Great Lakes. 

14. All the following are tributaries of the Missis- 

sippi River: Arkansas, Missouri, Ohio. 

15. The Big Sandy is the biggest river in the United 

States. 



True 


False V 


True 


False V 


True 


False X 


True 


False X 


True 


False V 


True 


False X 


True 


False V 


True 


False V 


True 


False V 


True 


False X 


True 


False V 


True 


False V 


True 


False V 


True 


False V 


True 


False X 



True 


False 


True 


False 


True 


False 


True 


False 



Measurement in Teaching 121 

16. The Atlantic Ocean is to the east and the Pacific 

Ocean to the west. True False O 

17. Canada is to the south and the Gulf of Mexico 

to the north. True False V 

18. The great lakes are five in number. True False V 

19. It is easier to sink while swimming in the largest 

lake east than in the largest west of the 

Mississippi. True False V 

20. The central portion of the United States is on the 

whole more level than the eastern or western 
portion. 

How to Compute Score for True-False Examina- 
tion. — 

Number of correct underlinings = 14 
Number of incorrect underlinings = 5 
Number of omissions = i 

(A) Pupil's score = number correct — number wrong. 
Pupil's scores = 14 — 5 = 9 
Let us consider first the reason for expressing a pupil's 
score as the number correct minus the number wrong. 
Imagine a pupil who is absolutely innocent of any knowl- 
edge of the physical features of the United States. Were 
such a pupil to take the above test and were he to mark 
every statement he would according to the theory of chance 
mark ten statements correctly and ten incorrectly. The 
chances of his guessing right or wrong are fifty-fifty or one 
to one. His score on the above test would be : 

Score ^10 — 10 = 

In short, the pupil's knowledge is zero and the method of 
computing his score gives him zero. Suppose instead that 
he knows ten statements and guesses at the other ten. Of 
the ten guessed at he would, according to chance, get five 
correct and five wrong. That is, even though his real 
knowledge is ten he will show fifteen correct (10 + 5) and 
five incorrect. The method of computing his score brings 
out his real knowledge. 

Score = 15 — 5 = 10 



12 2 How to Measure in Education 

A pupil who marks every statement correctly makes a per- 
fect score, viz.: 

Score = 20 — o = 20 

Observe that no account is taken of omissions. Only the 
corrects and incorrects figure in the pupil's score. When 
the time allowed the pupils to take the test is made short in 
order to test each pupil's speed of work there will, of course, 
be many papers showing several omissions each. In all 
such cases omissions should be ignored, just as we have 
done above, in computing scores. Even when the time 
allowed for the test is ample for each pupil to mark every 
statement, there will still be an occasional instance of omis- 
sion due to carelessness or misunderstanding of instructions 
or a puritanic conscience against increasing the score by 
gamble guess-work even when the instructions urge guessing. 

When the time is ample for even the slowest pupils and 
when all are instructed to mark every statement, it is much 
more convenient to compute a pupil's score according to the 
following formula: 
(B) Score = (number of statements) — 2 (number marked 

incorrectly) 
If there are twenty statements in the test and if five are 
marked incorrectly, 

Score = (20) — 2 (5) = 10 

Formula (A) gives the same results 

Score =15 — 5 = 10 

That both formulae give identical results provided there are 

no omissions may be shown viz.: 

Let T = Total number of statements, R == number right, 

and W = number wrong. 
Then 

Score = R — W (Formula A) 
R + W = T 

R = T — W 
Substituting in Formula A 



Measurement in Teaching 123 

Score = T — W — W = T — 2W ( Formula B ) 
Formula (A) is basic and should be used when there are 
omissions. Formula (B) should be preferred when there 
are no omissions or when they are present only in negligible 
amount. Formula (B) is much more convenient. The first 
number is always the same and since the second number is 
the total statements marked incorrectly, it is only necessary 
to score and total the errors. 

It is very difficult for some people to believe that such a 
test as has been outlined above does anything more than 
give the highest score to the luckiest guesser. They look 
with the eye of suspicion upon this thing we call chance. I 
once tossed pennies for heads or tails 50,000 times. The 
results came out 25,000 heads and 24,999 tails. Had there 
not been a miscount somewhere the two would doubtless 
have come out exactly even. I had occasion to watch two 
summer-school teachers in that nerve-racking game of 
chance called matching pennies. They matched for several 
minutes daily. The last heard they were still matching 
pennies and chance had prevented either from getting com- 
plete possession of the other's 100 pennies. Chance is fatally 
exact when the pennies or the statements in the test are 
numerous. The opportunities for injustice in score multiply 
in proportion as the number of statements is reduced. 
Hence there should be as many statements in the test as 
practical limitations will permit. 

Detailed Construction of True-False Examination. — 
There are a few suggestions which will help teachers in con- 
structing the True-False test. In the first place the teacher 
should so construct the test that it will contain approximately 
the same number of true and false statements. A clever 
pupil may get a higher score than he deserves if he dis- 
covers there are many more true statements than false 
statements in the test or vice versa. Suppose there are many 
more true statements than false statements and suppose 
some pupil discovers this by observing the statements that 
he knows, or by observing the teacher's bias for writing 



124 How to Measure in Education 

true statements instead of false ones. Naturally when he 
does not know what to mark he will mark True, thereby 
securing a larger score than his ability justifies. Probably 
it is by just such utilization of the errors of others that the 
intelligent get through life so much more smoothly than 
the stupid. On the other hand, the teacher should not have 
exactly the same number of true and false statements each 
time, because this will invite clever pupils to count back to 
see how many more true statements have been marked than 
false statements. Sometimes there should be more true 
statements, sometimes more false statements, sometimes the 
same number of each. Any regularity of plan should be 
carefully avoided. An English admiral complimented the 
skill of German submarine commanders by saying they were 
masters of irregularity. All the true statements should not 
come first, neither should the true and false statements be 
alternated as a regular plan. Let chance determine what 
shall be true and what shall be false and in what order the 
true and the false shall come. 

Second, the teacher should be careful to keep out of the 
test all ambiguous statements. Statement number i8 in the 
sample test is somewhat ambiguous. It says: "The great 
lakes are five in number." Since great lakes is not capi- 
talized a pupil might very legitimately interpret this to 
include the Great Salt Lake and others. It will later be 
difficult to satisfy this pupil that his score should suffer be- 
cause of the construction he gave this sentence. If the 
teacher will stud}^ her mistakes in this respect she will soon 
learn how to reduce such ambiguities. As any teacher can 
testify, the danger of ambiguities of wording are not 
peculiar to this test. This type of test does not, however, 
give a pupil an opportunity to reveal just what interpreta- 
tion he places upon each statement unless the teacher follows 
the procedure of having pupils score their own or each 
other's paper. Self-scoring will reveal all cases of ambiguity. 
Statements which are particularly flagrant in this respect 
can be omitted in scoring. 



Measurement in Teaching 125 

Third, the teacher should inspect not only this but any 
sort of test from the point of view of just what the test 
measures. Statement 19 in the sample test illustrates this 
point. The purpose is to test whether the pupil knows that 
the largest lake west of the Mississippi River contains more 
salt than the largest lake east of the Mississippi. Instead of 
measuring this I may be testing whether a pupil knows that 
it is easier to sink in fresh water than in salt water. Com- 
plex wording, unfamiliar terms, the use of negatives, all tend 
to make the test a linguistic one. Simple, brief statements 
without negatives are best. 

Fourth, the teacher may so construct the examination as 
to force pupils to guess wrong due to the power of sugges- 
tion. This probably explains why statement 1 5 was marked 
wrongly. The pupil doubtless argued to himself that since 
the river is named the Big Sandy it probably is the biggest 
river in the United States. The influence of having many 
suggestive statements in the test is to make the examination 
more difficult. It operates to give to the pupil who knows 
nothing at all in the test a large negative score instead of 
a zero score and it penalizes rather heavily the pupil who 
does much guessing, for every time he allows himself to be 
suggested in the wrong direction a point is subtracted from 
the score he has already made by what knowledge he has. 
In other words, the suggestive statements make the gap 
between those who know much and those who know little 
wider than it otherwise would be. Whether a pupil should 
be specially penalized for yielding to suggestion is an 
arguable question. There may be situations where it is 
eminently desirable to determine whether pupils know what 
they know so well as to be able to resist suggestion. In 
general, however, it is best to avoid suggestive statements. 
The ideal should be to construct the examination so that 
any pupil who knows absolutely nothing about the test will 
make a score of zero. 

In sum, the examination should harmonize with the fol- 
lowing suggestions: 



126 How to Measure in Education 

1. Have approximately the same number of true and 
false statements and have them arranged in chance order. 

2. Avoid ambiguous statements. 

3. Avoid suggestive statements. 

4. Avoid trivial statements lest they induce wrong habits 
of study. 

5. Avoid the use of negatives. 

6. Make the statements brief. 

7. See that one statement does not answer a preceding 
one. 

Methods of Applying True-False Test. — So much for 
the construction of the examination. How shall it be 
applied? The best way is to print, mimeograph, or other- 
wise duplicate, the examination, and place a copy in the 
hands of each pupil. But there are numerous schools which 
lack duplicating machines. For teachers in these schools 
some other means of applying the test must be found. Any 
one of the following methods may be used. First, the entire 
test may be copied word for word by the pupils and then 
marked. This is tedious and time-consuming. Second, the 
entire test may be written on the blackboard by the teacher. 
Each pupil could number a blank page of paper to corre- 
spond to the numbered statements, and then write True or 
False after the appropriate numbers. The only objection 
to this suggestion is the inconvenience of writing all the 
statements on the blackboard. Third, the pupils may be 
asked to copy on blank paper, i, 2, 3 and so on, according 
to the number of statements. The teacher can then read 
orally statement i and instruct the pupils to make a check 
after the number i on their paper if the statement is true, 
but to make a cross if the statement is false. This is easily 
the most convenient way to give the examination. The 
chief objection to this final method is the difficulty some 
pupils have in apprehending statements presented orally, 
particularly if they are long and complicated. When the 
statement is presented visually the pupil has an opportunity 
to go back to it enough times to exhaust his possibility of 



Measurement in Teaching 127 

understanding it. By one or another of these methods it is 
possible for any teacher anywhere to make use of this .type 
of examination. 

Scoring of True-False Examination.— How shall the 
True-False examinations be scored? If a copy of the test 
has been placed in the hands of each pupil, the teacher can 
take an unused test sheet, fill it out correctly, lay the correct 
column of answers beside each pupil's column of answers, 
and quickly mark whether the pupil's answers are correct 
or incorrect. If a copy of the test has not been placed in 
the hands of each pupil, but each has instead written True 
or False, or made a check or cross after the number of each 
statement, the teacher can take a page of paper similar to 
that on which each pupil has indicated his answers, copy 
the numbers just as they are and just as they are spaced on 
the pupils' papers, write after each number the correct 
answers to the statement of that number, place this column 
of correct answers beside the column of pupil answers and 
mark those which are correct and incorrect. This last scor- 
ing method presupposes that pupils have used ruled paper, 
and that each has written his numbers in a vertical column 
according to a particular spacing recommended by the 
teacher. Last and best, each pupil can score his own or his 
neighbor's paper. It is better for him to score his own. 

If the method of pupil scoring is adopted, the teacher 
should read the correct answers while the pupil checks his 
own. If the pupil does not have a copy of the statements 
before him, the teacher should read each statement before 
giving the correct answers, in order that the pupil may know 
what statements he got correct or incorrect. When all the 
pupils' answers have been marked and when all their scores 
have been computed and recorded on their examination 
paper, the teacher should ask all the pupils who missed state- 
ment number i to hold up their hands, and then all pupils 
who missed number 2 to hold up their hands, and so on. 
The teacher should make a record of the number of pupils 
missing each statement, and then collect all papers. 



128 How to Measure in Education 

But pupils will cheat. To be sure some will cheat. It 
will advantage us nothing to delude ourselves into the belief 
that cheating will not occur. To do so would be to join the 
peerage of the ostrich that is fallaciously reported to stick 
its head into the sand and think itself safe, or of the 
partridge which dives into a snow bank and feels as secure 
of its safety as the hunter feels of his game. It would 
advantage us still less to compel honesty by so arranging 
all educational situations that there is no opportunity to be 
dishonest. The chances that the world will be so tender of 
a pupil's weakness are very few indeed. If a pupil has it 
in him to be dishonest, it is a genuine kindness for the 
teacher to find it out. The benevolent birch removes less 
epidermis than the rod of the law. 

However determined, the scores for the pupils may be left 
either in their original form or they may be scaled. A later 
chapter describes a very simple method of converting crude 
scores on an examination into scale scores in terms of a com- 
mon basic unit called T. The few moments consumed in 
making this transmutation are more than repaid by having 
records for each child which are comparable from examina- 
tion to examination. 

Advantages of True-False Examination. — But why 
should this sort of examination be used at all? Wherein is 
it better than the examination method in common use? In 
the first place, the True-False examination permits a teacher 
to cover a wider field of subject matter or a wider range 
of ability per unit of time. It may be made more representa- 
tive of the total field of the pupils' study and hence be a 
fairer measure of the pupil. In the case of the traditional 
examination the teacher is forced to select a very small 
number of questions. When we were students almost as 
much of our ingenuity went into divining the kind of ex- 
amination questions the teacher would ask as in reviewing 
for examination. Now that we are teachers we have no 
reason to suppose that this practice has ceased. 

The use of this type of examination is likely to improve 



Measurement in Teaching 129 

the relation between teacher and pupils. The traditional 
examination endangers a pleasant relationship because pupils 
more or less justly suspect that the score they make depends 
almost as much upon their conduct as upon their product. 
The proposed test convinces a pupil that the score he gets 
is the score he deserves. Such a conviction is a real event 
in a pupil's educational career. 

The True-False examination is more enjoyed by the 
pupils. The pupils enjoy it more because it offers an oppor- 
tunity for a contest where the rules are fair, and because 
it offers them a chance for a large degree of participation 
in the examination. It is agonizing for a pupil to describe 
at great length a knowledge which he does not possess in 
hopes that his command of English will camouflage his lack 
of information. Here is a question which was asked in a 
recent examination in educational measurement. ''Which 
three of the tests described by Whipple do you think would 
be of most service in an elementary school, if your school 
had a school psychologist to apply them?" Consider the 
perspiration it must have cost a student to perpetrate this 
answer: 

"The tests described by Whipple embraced most of the 
difficulties that would be embraced in problems of classroom 
instruction. I think his tests embrace a great variety of 
methods of approach and it seems difficult for me to think 
of just three to whom the presence of a psychologist in a 
school would give help. I would think it would be the tests 
in which knowledge of the workings of a child's mind and 
its growth and development would be most apparent since 
those not particularly trained might focus on others not of 
this kind. I fear it would be unwise to specifically mention 
just three when the number is so great which would fulfill 
all these requirements. Every teacher to be a psychologist 
would help all classroom measurement work of whatever 
kind greatly, I know; since we cannot know of the influence 
of a test upon any group except by the mental reaction 
produced." 



130 How to Measure in Education 

The True-False examination is more enjoyed by the 
teacher. The scoring is easy, rapid, and automatic when 
she does the scoring, and far more rapid when the pupils do 
the scoring. The pupils cannot well assist in scoring the 
traditional examination, and for the teacher to score forty 
verbose examination papers is time-consuming drudgery. 
Every moment of the time while scoring, the teacher must 
be profoundly concentrating upon what she is reading, for 
much of the time she must be separating the chaff from the 
wheat where the chaff is cleverly painted to look like wheat. 
And along with this is a continual emotional strain caused 
by her resistance to the temptation to underscore some and 
overscore others. 

The True-False examination is more educative for the 
pupils. The proposition that pupil scoring will relieve the 
teacher of much obnoxious drudgery, does not justify the 
inference frequently made that what is non-educative 
drudgery for the teacher will also be non-educative drudgery 
for the pupils. On the contrary the most favorable teaching 
opportunity that ever comes to a teacher is the period imme- 
diately following an examination. The pupil's interest to 
know what parts of the examination he missed and what 
he got correct is then at white heat. Witness the interested 
discussion among pupils immediately following an examina- 
tion. It is inexcusable neglect of an educational opportunity 
not to capitalize these precious moments for correcting 
erroneous ideas, clinching right ideas, and filling up mental 
spaces where ideas are not. These values can best be real- 
ized by having each pupil score his own paper and by stop- 
ping to discuss points where pupils have trouble. Of course 
not every correct answer indicates knowledge, but the pupil 
himself usually knows when he knows. This examination 
is also more educative, because it is likely to be given more 
frequently. The experience of Kirby, Courtis and others 
with practice tests shows that a pupil learns more during 
testing periods than during teaching periods. We really 
teach when we test. This examination covering as it can a 



Measurement in Teaching 131 

wide range is an ideal method of review. It reveals to the 
pupils just where their difficulties lie. Testing is one of the 
best ways of teaching. 

The True-False examination gives the teacher a fuller 
knowledge of conditions. The educative value of testing 
is so great that testing should be much more frequent than 
is now the case. Now that a method of testing is available 
which involves no drudgery to anyone, testing is likely to 
become more frequent, and this means more complete and 
timely information about the abilities and difficulties of the 
various pupils, and about the successes and failures of teach- 
ing efforts. It has already been suggested that the teacher 
keep a record of the number or per cent of pupils missing 
each statement in the examination. This record will show 
what things have been well learned or poorly learned and 
well taught or poorly taught. Also it is a good thing for a 
teacher to check her own efficiency in general. This can 
be done by finding the average of the scores of all the 
pupils and by comparing this average with the total number 
of statements in the examination or at least the total num- 
ber of facts the teacher has really attempted to teach the 
pupils. If the average score is 20 out of a possible 40, the 
teacher's efficiency is 50%. Most teachers will be chagrined 
to find, if they use truly representative statements in their 
examination, that their efficiency is below 50%. Similarly, 
a pupil's efficiency may be determined by the per cent of 
statements he got correct out of the total number of state- 
ments the teacher has a right to expect him to get. Before 
the examination is given the teacher should decide what 
statements she has a right to expect the pupils to get cor- 
rect. This same number should then be used for computing 
both pupil and teacher efficiency. 

Finally, the True-False examination is a genuine honesty 
test, and shows the beginnings of a technique for measuring 
in satisfactory fashion this valuable character trait. Occa- 
sional and unannounced rescoring of each pupil's paper by 
his neighbor will catch the persistent cheat. It is better 



132 How to Measure in Education 

that he be discovered in school than in court. His discipHne 
can usually be left to his fellow pupils, over whom he was 
attempting to gain an advantage dishonestly. 

There has been listed what appear to be the chief advan- 
tages of this type of examination or any other examination 
which is similarly objective. Most of these claims rest 
upon logical probability and a limited experience and not 
upon experimental data. This last is needed and will follow 
in time. 

There are some limitations which have not been discussed. 
It is claimed first that this examination does not require 
the pupil to demonstrate a power to organize his materials. 
This is true in the sense that the pupil does not describe in 
writing a complicated mental organization but a statement 
can be so worded as to require an exceedingly complex 
mental organization before a correct answer can be unfail- 
ingly given. Consider the mental organization that must 
precede a correct answer to this simple statement: 'Tf the 
trade winds blew east Peru would have luxuriant flora." 
If it is desired to test a pupil's power to word his thought a 
composition test may be given. 

Again, it is claimed that this examination can test knowl- 
edge but not skill, knowledge but not the ability to do. 
Even skills can be tested by this examination. To reason 
that trade winds blowing east would be warm, would absorb 
moisture from the Pacific, would become chilled in passing 
over the Andes, would consequently deposit a heavy rainfall 
for Peru, which taken in conjunction with the equatorial 
climate would produce a luxuriant flora, is one sort of skill 
which this examination will test. Mathematical skills and 
the like which are too comphcated to describe may be tested 
in at least two ways, though there are better ways. An 
example or problem may be stated together with an answer. 
The pupil's task would be to determine by working the 
problem whether the answer given is true or false. Or in- 
stead, the teacher can work the problem on the blackboard 



Measurement in Teaching 133 

for all the pupils and have them indicate whether her process 
was correct or incorrect. 

Finally, it is claimed that the teacher needs to know why 
a pupil is unable to answer the question about Peru and its 
flora. The True-False examination does not show just where 
the pupil's reasoning process went wrong or stopped alto- 
gether. It is not diagnostic. This criticism has some force. 
An examination should be as diagnostic as possible. If a 
teacher wished to know where the pupil's process broke 
down she could give a subsequent, more detailed examina- 
tion of this type. The statement, "The trade winds are 
warm winds," or "Warm winds have a larger capacity for 
water than cool winds," etc., would reveal whether the pupils 
were acquainted with the basic principles, facts and the like 
necessary to reason out the correct answer to: "If the trade 
winds blew east Peru would have a luxuriant flora." 

The traditional examination has a certain advantage which 
will doubtless continue its existence. The True-False 
examination is but a herald of the newer and better types of 
examination to be. But even now we have in the True-False 
examination one that may be used by any teacher anywhere 
to great advantage. 

Such informal examinations are extremely helpful in the 
teaching of reading. There follows an illustrative applica- 
tion. 

Divide the Pupils of the Fourth-Grade Class (Table 8) 
into Two Groups of Equal Reading Ability. — Motivation 
can be increased by dividing the pupils in the illustrative 
fourth-grade class into two groups of equal ability in 
reading. This division should be made on the basis of the 
scores made in the initial test or tests. A division made 
on the basis of scores on the Thorndike-McCall Reading 
Scale will prove sufficiently accurate for the purpose. 

To make such a division arrange the initial scores on the 
reading scale in order of size, i. e., largest score first, second 
largest score second, and so on. Put the ablest pupil into 



134 How to Measure in Education 

Group I, the second ablest into Group II, the third ablest 
into Group II, the fourth ablest into Group I, the fifth ablest 
into Group I, the sixth ablest into Group II, and so on for 
the other pupils. This will give two groups of practically 
identical ability. 

The two groups so formed should each be encouraged to 
select a captain, to decide upon a name for the team, to make 
or select a team motto, and to do anything else which the 
ingenuity of the teacher or the initiative of the pupils can 
originate to increase interest in the competition. The team 
organization may be used to motivate spelling bees, compo- 
sition work and the like. 

Give an Informal Reading Test at the End of Each 
Week. — After each week of instruction, the teacher 
should give a test which she herself has constructed from 
reading material in the regular class reader. Occasionally it 
would be well to use instead a selection from the text-book 
in history, geography or the like. 

This test should be in the nature of a contest between the 
two groups in the class. In order that the pupils may not 
sacrifice speed of reading to comprehension of what is read 
and vice versa, the weekly test should measure both speed 
and comprehension. The following discussion describes a 
method of constructing such a test which is simple and ex- 
peditious, and results in a type of test which saves the 
teacher the labor of scoring papers, and which is interesting 
and educative to the children. 

The first step is to select from the reader two pages which 
are fairly representative of the other pages of the book, 
which are, if possible, unbroken by pictures, and which have 
been previously read by the pupils, though this last is not 
absolutely essential. Assume that pages lo and ii of the 
Fourth Reader would be a satisfactory selection. 

After making this selection the next step is to formulate 
twenty true-false or yes-no questions based upon the content 
of pages 10 and ii. How to construct such a test has just 
been described. 



Measurement in Teaching 135 

The next step is to give the following directions to the 
pupils: 

Take out a pencil and a sheet of paper. . . . Write your name near the 
top of the sheet. . . . Now take out your reader and turn to page g. . . . 
I shall read aloud the last paragraph on page g. Follow me as I read. 
Just as soon as I read the last word, turn over the page and continue to 
read silently without help. Read as fast as you can get the thought. 
When you have finished reading I shall ask you some questions to see 
how much you can remember of what you have read. Read both pages 
10 and II. Just as soon as you read the last word on page ii close your 
book, look at the blackboard, and copy on your paper the number you 
see there. 

Regulate the speed of reading so that the last word on 
page 9 will be read just as the minute hand of the watch is 
at some ten-seconds point. At the expiration of the first ten 
seconds write the number 10 on the blackboard. At the 
expiration of 20 seconds erase the 10 and write 20. Con- 
tinue similarly until all the pupils have finished. 

When all have finished and closed their books give the 
pupils the following directions: 

Write the numbers i to 20 inclusive down the left hand margin of your 
blank paper. . . . I shall ask you some questions about what you have 
just read. Each question should be answered with a yes or a no. // you 
do not know the answer to any question guess at it. If the answer to 
the first question is yes, write yes after number i. If it is NO, write no 
after number i. Treat the other questions in the same way. 

If the pupils are unable to write they may be instructed 
to make a check mark when the answer is yes and a cross 
mark when the answer is no. Very young pupils will need 
a preliminary test the sole purpose of which is to teach them 
how to take the test. 

Read each question aloud, slowly and distinctly, giving 
each pupil enough time to write his answer. 

When the pupils have finished this test from memory have 
them turn their sheets over and write the numbers i to 20 
again. Then have them open their books to pages 10 and 
II. Read the questions once more to see how many ques- 
tions the pupils can answer with their books open. 

Call the correct answers, once for each test. Have the 



136 How to Measure in Education 

pupils score themselves or each other's paper or both and 
compute their own scores. Have pupils mark answers that 
are wrong. Omitted questions are counted as wrong. The 
score on each test for each pupil is found by subtracting 
two times the number of errors he makes from the total 
number of questions. Thus if pupil A had six mistakes on 
the first part and three on the second part of the test, his 
comprehension scores would be: 

Memory comprehension score = 20 — 2(6)= 8 
Visual comprehension score =20 — 2(3) = i4 

Next, tell the pupils the number of words on the two pages 
and have each pupil compute the number of words read per 
minute. This is the pupiPs speed-of-reading score. 

Have each pupil preserve a record of his own scores from 
week to week in order that he may see whether he is im- 
proving in his speed of reading in particular. This measure 
of improvement will be necessarily crude because the 
material of the different tests will vary somewhat in diffi- 
culty from time to time. 

Compute the mean memory-comprehension score, the 
mean visual-comprehension score and the mean speed-of- 
reading score. The teacher should keep a record of these 
three class scores and observe whether pupils are making 
progress under her instruction. 

The scores made by a small fourth-grade class on such an 
informal silent reading test are shown in Table 13. 

How to Utilize Results.— The scores of Table 13 sug- 
gest the following useful conclusions: 

I. Pupils A, E, and H are poor readers. Pupil E could 
do nothing whatever. The three are deficient in every re- 
spect due to insufficient training, or improper training, or 
lack of native intelligence or some other cause. Whether 
it is due to insufficient training can be determined by watch- 
ing the progress when further training is applied or 
by enquiring concerning the child's educational history. 
Whether it is due to improper training can be determined 
by measuring progress after giving proper training. 



I 



Measurement in Teaching 

TABLE 13 



137 



Scores Made by a Fourth-Grade Class on an Informal Silent 

Reading Test 



Pupil 


Memory 
Comprehension 


Visual 
Comprehension 


Speed 


A 
B 
C 
D 
E 


2 
5 
7 
5 



6 

8 

12 

16 




125 

127 

150 
121 


F 

G 

H 

I 

J 


4 
7 


13 
6 


18 
10 
2 
16 
12 


200 
150 
100 

131 
146 


K 
L 

M 
N 



8 
18 

15 
10 

9 


10 
20 
16 
12 
12 


150 
180 
172 
146 
150 


P 

Q 
R 

S 
T 


4 
10 

7 

12 

6 


8 

14 
12 

14 
10 


104 
140 
140 
166 
127 


Mean 


7.4 


1 1.4 


136.3 



Whether it is due to lack of sufficient native intelHgence is 
best determined by means of an intelligence test. 

2. Pupil D has an exceptionally poor memory. That he 
can comprehend what he reads is shown by his high score at 
visual comprehension. The slow rate at which he reads 
indicates that his poor showing was not due to hasty reading. 

3. Pupil F makes an exceedingly low score at memory 
comprehension, considering his very high record at visual 
comprehension. In all probability this is not due to a native 
deficiency of memory, but to extremely rapid reading. In 
fact, pupil F reads more rapidly than anyone else in the 



138 How to Measure in Education 

class. He should be taught how to control his speed of 
reading. 

4. As indicated by the memory-comprehension and 
visual-comprehension scores, pupil I is a very superior 
reader. However, he is not a particularly speedy reader. 
Above all he needs training in speed. 

5. Pupils L and M are superior readers in every respect. 
They read rapidly. They have a tenacious memory. Their 
visual comprehension, which is probably the most important 
of the three, is equally superior. 

6. Records not included in Table 13 are also useful. As 
previously recommended have those pupils who answer the 
first question correctly on the last part of the test, hold up 
their hands. Do this for each question. Questions missed 
by many in the class should be discussed. Finding the 
questions which have been missed by the class reveals the 
place where teaching is needed. 

Give an Informal Test of the Speed and Quality of Oral 
Reading. — William S. Gray has done more careful experi- 
mentation with methods of measuring the speed and quahty 
of oral reading than anyone else. The teacher cannot do 
better than follow the procedure which he has evolved. 
This procedure is modified below to fit an informal test. 

He suggests that each pupil be tested individually in some 
quiet place free from distractions and where other pupils 
cannot hear what is taking place and thus profit, when their 
turn comes, by the mistakes of their predecessors. 

The teacher should hand to the pupil his regular reader 
open at a previously selected page where the child is to 
read. As she does this she should say: / want you to read 
page .... aloud for me. Begin with the first word at the 
top of the page when I say Begin! Read until I tell you to 
stop. If you find some hard words, read them as best you 
can without help and continue reading. 

In case a pupil hesitates several seconds on a difficult 
word, pronounce it for him and mark it as mispronounced. 
If the first word at the top of the page is not the beginning 



Measurement in Teaching 139 

of a sentence begin time and error records the instant the 
pupil completes the partial sentence. 

Record the exact second when the pupil begins the first 
whole sentence and the exact second when the pupil finishes 
the page^ paragraph, or whatever amount of material the 
teacher has previously elected to have read. The number 
of words read per minute is the pupil's speed-of -oral-reading 
score. 

While the pupil is reading the teacher should carefully 
record the errors made. The following, quoted from Gray, 
illustrates the character of the errors and the method of 
recording them. 

The sun pierced into m^ large windows. It wjis the opening of October, 
and the^sky was^a dazzling blue. I looked out of my window(^nS)down 
the street. Thg white hous^of the long, straight street wercj^^ost painTuI 
to the eyes. The dear atmosphere allowed full play to jl^ su;j^J)f ightness. 

"If a word is wholly mispronounced, underline it as in the 
case of ^atmosphere.' If a portion of a word is mispro- 
nounced, mark appropriately as indicated above: 'pierced' 
pronounced in two syllables, sounding long a in 'dazzling,' 
omitting the s in 'houses' or the al from 'almost,' or the r 
in 'straight.' Omitted words are marked as in the case of 
'of and 'and'; substitutions as in the case of 'many' for 
'my'; insertions as in the case of 'clear'; and repetitions as 
in the case of 'to the sun's.' Two or more words should be 
repeated to count as a repetition. 

"It is very difficult to record the exact nature of each 
error. Do this as nearly as you can. In all cases where you 
are unable to define clearly the specific character of the 
error, underline the word or portion of the word mispro- 
nounced. Be sure you put down a mark for each error. In 
case you are not sure that an error was made, give the pupil 
the benefit of the doubt. If the pupil has a slight foreign 
accent, distinguish carefully between this difficulty and real 
errors." 



140 How to Measure in Education 

The pupil's quality-of -oral-reading score is found by mul- 
tiplying the number of errors made by loo and by dividing 
this product by the number of words in the passage read. A 
large score should be interpreted as poor oral reading. 

The speed-of -oral-reading score for the class is the mean 
of the speed-of-oral-reading scores for the individual pupils, 
and the quality-of -oral-reading score for the class is the 
mean of the quality-of -oral-reading scores for the individual 
pupils. Both these class scores and the pupil scores should 
be preserved in order to measure the amount of growth. 

III. The Use of Standardized Scales 

8 and 9. Determine the Final Reading Age and 
Mental Age. — The final reading age for the end of the cur- 
rent year must be estimated and so must the mental age. In 
addition to its other valuable functions the Intelligence 
Quotient aids us in estimating what each pupil's reading 
age or mental age should be at the end of, say, ten months 
of instruction, for a pupil's I.Q. is a rather accurate index 
of his capacity for progress in silent reading. Reading 
Quotient will serve as a reasonably good substitute when 
I.Q. has not been determined. A pupil whose I.Q. is 90 
should be expected to progress only 90 per cent as fast or as 
far in 10 months as a pupil whose I.Q. is 100. Hence to 
estimate a pupil's final reading age all that is necessary is to 
add to his initial reading age 90 per cent of 10 months if 
the pupil's I.Q. is 90 and the school term is 10 months, or 
90 per cent of 8 months if the pupil's I.Q. is 90 and the 
school term is 8 months. Thus pupil A's final reading age is 
121 plus 100 per cent of 10 months, i. e., 131, as shown in 
line 8 of Table 8. His final mental age, computed similarly, 
is 121 plus 100 per cent of 10, i. e., 131, as shown in line 9 
of Table 8. Pupil B's final reading age is 130 plus 122 per 
cent of 10, i. e., 142. His final mental age is 150 plus 122 
per cent of 10, i. e., 162. These computations assume that 
the school term is 10 months. 



Measurement in Teaching 141 

I.Q. will prove a reasonably satisfactory basis for esti- 
mating progress in all such complex functions as silent read- 
ing. It may not prove satisfactory for such a narrow func- 
tion as handwriting. Prophecy as to progress in hand- 
writing may have to be based upon a measurement of special 
capacity rather than general intelligence. Lacking this the 
teacher may use Handwriting Quotient or may arbitrarily 
assign objectives. 

10. Determine the Objective for the End of the 
Year, — The objective in reading for each pupil should be 
either his estimated final reading age or his estimated final 
mental age. As a rule it will doubtless prove more just and 
satisfactory both to teacher and pupil to use the estimated 
final reading age as the minimum objective in preference to 
the estimated final mental age. The reading age as an 
objective requires each pupil to make normal progress for his 
capacity. If he has, in the past, exceeded normal expecta- 
tion, this objective will preserve all such gain. Mental age 
on the contrary would not preserve the gain. Furthermore, 
any pupil whose Accomplishment Quotient was below 100, 
as for example pupil B, would be required to do the difficult 
task of making up all this deficiency in a single year. 
Finally mental age would place a heavy burden upon schools 
whose term is short. Hence the estimated final reading age 
for each pupil should define the minimum objective in 
reading. 

But the estimated final reading age should be considered a 
minimum and not a maximum objective. It makes no pro- 
vision for recovering past losses on the part of pupils whose 
Accomplishment Quotients are less than 100. These pupils 
in particular should look upon their objectives as low 
minima. Reading age objectives should not be stopping 
places for any pupil no matter what his Accomplishment 
Quotient. Such objectives are meant to be passed and, if 
possible, left far in the rear. Any teacher who faithfully 
carries out the procedure here outlined should expect to 
exceed such goals. There should be minimum objectives, 



142 How to Measure in Education 

but there should not be any maximum objectives except in 
the sense that other purposes and abilities of vital importance 
should not be sacrificed to attain some far-off goal in reading. 
However, it should not be forgotten that he who increases his 
ability in reading is contributing to an increase in many other 
abilities. 

The objective in terms of reading age should be trans- 
lated back into a score on the reading scale. Pupil A's esti- 
mated final reading age is 131. By consulting Table 11 it 
is found that 131 corresponds to a T score of 43. Hence 
pupil A's minimum objective is 43. Pupil B's objective is 
47. The objectives for the other pupils are shown in line 
10 of Table 8. 

The objective for the class as a whole is the mean of the 
score objectives for the individual pupils. The class objec- 
tive is then, as shown, 39.8. 

The common practice of setting up no definite visible ob- 
jective at all could not be expected to produce other than 
the current indifference toward improvement. We would 
question the intelligence of any adult who seemed to be in a 
great hurry if he did not know where he was nor where he 
was going. Without the initial measurements already rec- 
ommended, children, as Foote of Louisiana points out, prac- 
tically do not know where they are and without more definite 
objectives they do not know in any thrilling way just where 
they are to go. It may be a tribute to children's intelligence 
that they are listless and uninterested. 

The common practice of setting up definite objectives 
which are not objectives at all but impossible ideals for the 
class can only produce discouragement. Either because of 
delusions of grandeur concerning their own efficiency or 
because of an irrational confidence in their pupils non- 
technically-trained teachers and supervisors almost invari- 
ably set up impossibly distant objectives. Recently a group 
of unusually progressive teachers decided to set up objectives 
in composition. After months of study sample compositions 
were selected to mark the passing point for each grade. 



Measurement in Teaching 143 

When these specimens were measured on a standardized com- 
position scale it was found that the specimen selected to indi- 
cate the passing point for the fifth grade was of a quality 
which twenty-five per cent of sophomore college students 
could not equal. 

The common practice of setting up a definite objective 
which is reasonable for the class as a whole, but which is 
the same for all the pupils in the class is almost equally bad. 
It violates a fundamental psychological law that pupils differ 
and differ greatly in both their initial ability and their capac- 
ity to make progress. The following fable from the kose 
Garden of the Persian poet, Sa'di, the "nightingale of 
Shiraz," is still true: 

"A king handed over his son to a teacher and said, 'This is my son; 
educate him as one of thine own sons.' The perceptor spent some years 
in endeavoring to teach him without success, while his own sons were made 
perfect in learning and eloquence. The king took the perceptor to task, 
and said, *Thou hast acted contrary to thy agreement, and hast not been 
faithful to thy promise.' He replied, 'O King! education is the same, but 
capacities differ.' " 

Graph the Immediate and Final Objectives. — It is 
well to have not only annual but also monthly objectives for 
each pupil and to have these graphed by the pupils them- 
selves. There are enough of the Thorndike-McCall Reading 
Scales to give one each month, consequently each pupil 
should divide the total distance he must travel during the 
year into ten equal portions if the school term is ten months, 
or five equal portions if the school term is five months. 
Pupil A's initial score is 40. His final objective is 43. He 
must gain 3 points in 10 months, i. e., three-tenths of a point 
each month. A large sheet of coordinate paper may be 
tacked upon the wall of the school room and each pupil can 
lay off his monthly objectives on this single large sheet. 

Graph the Objectives for All Classes in the School. — 
Progress in reading can be motivated even better if all the 
classes in the school are carrying out the reading program 
outlined. In this case, or even when just the adjoining 
grades participate, it would be advisable to construct a large 



144 How to Measure in Education 

chart showing the immediate and final goals for each grade. 
This chart should be posted in some conspicuous place in the 
school building. The classes may then compete to see 
which one can first attain its minimum objective. The 
monthly progress made by each class should be indicated 
on this school graph. 

The particular fourth grade shown in Table 8 has an 
initial score of 36.4 and a final objective of 39.8. Hence 
for this class the school chart would graphically show the 
initial point and the subsequent monthly hurdles, thus: 
36.4 36.7 37.1 37.4 37.8 38.1 38.4 38.8 39.1 39.5 39.8 

Lest the graphic location of pupil objectives on a publicly 
posted chart humiliate any pupil who saw that his bar on the 
graph was shorter than other bars, the teacher should make 
the diagram in such a way that all bars are of equal length. 
The principal of the school should follow a similar plan in 
constructing the diagram for the contest between classes. 
The teacher and principal would know that one inch, say, 
of a bar meant, perhaps five points for one child or class and, 
possibly, eight points for some other child or class. 

Visible Goals for Education. — During the World War 
an indispensible device for increasing military efficiency was 
a battle map showing the present positions of the contending 
armies and clearly defining the objectives of attack. Sim- 
ilarly all great industrial corporations find it necessary to 
keep a production chart which shows the extent to which 
actual production has kept pace with the desired production. 
Education is the world's greatest manufacturing industry. 
Completely surrounded by its foes it is attacking in all direc- 
tions in the world's greatest battle. And yet education has 
neither battle map nor production chart. Its goals can only 
be made visible as objective measurement develops. 

To carry the battle analogy further, the attack of the 
school should be like the attack of the various divisions of a 
large army. The General Staff marks out a series of goals 
for each division. Certain divisions move forward and cap- 
ture their objectives before certain other divisions even leave 



Measurement in Teaching 145 

their trenches. When a division has attained its first objec- 
tive it is instructed to "dig in" until other divisions have 
reached their assigned goals. To move divisions forward 
without regard to the position of cooperating divisions would 
mean speedy disaster. But pupils are tough. They are 
frequently pushed limpingly toward one goal before they 
have reached a prerequisite or more valuable goal in another 
goal series. Such neglect of proper emphasis will continue 
until objective measurement has made visible both the cur- 
riculum of purposes and the curriculum of abilities as well 
as the position of the pupil with reference to the curriculum. 

This suggestion that tests be used as the regulator of edu- 
cational emphasis will be opposed by a large group of well- 
meaning educators. The humanity of Pestalozzi and the 
sympathy for childhood of the good people who have fol- 
lowed in his train could not abide the dry-as-dust drill to 
which children were subjected. The reaction away from 
the drill subjects by certain educators is more than an emo- 
tional one. It is in part due to a real change in their con- 
ceptions of what is most worth while in education. They 
desire, and rightly so, a greater emphasis upon those virtues 
which have to do with civic responsibilities and other rela- 
tionships. Since most existing tests measure drill subjects 
there is a grave fear that the widespread use of tests will 
merely increase the emphasis upon what they conceive to 
be the relatively less valuable abilities. 

The attitude of these educators is wholly honest but sub- 
stantially unwise. There is grave danger that they will use 
their ingenuity not to devise ways of tying the skills to 
children's purposes in such a way that drill will be interest- 
ing, but to undermine our conception of the tremendous value 
of these skills. Tom dreamt that he and many other chim- 
ney sweeps were locked in black coffins and that there came 
an angel with a golden key and unlocked the coffins and 
set them all free. The golden keys with which teachers 
unlock the minds of children are the basic skills. They are 
more valuable even than Virgil's golden bough for they open 



146 How to Measure in Education 

the very gates of life. The skills are valueless in themselves, 
and at the same time they are the indispensable prerequisites 
of all that is valuable in education. Like the centaur's 
tunic they cannot be torn off without carrying away the flesh 
and blood of the wearer. 

Take reading, for example. Carlyle was not far wrong 
when he said that all any school can do is to teach us how 
to read. Carlyle tells how Odin was credited with the great- 
est invention man has ever made, namely, the invention of 
letters whereby man may mark down the unseen thoughts 
that are in him. He tells of the astonishment of Atahualpa 
the Peruvian king; how he made the Spanish soldier who 
was guarding him scratch dios on his thumb-nail, that he 
might try the next soldier with it and thus ascertain whether 
such a miracle were possible. Odin deserved his deifica- 
tion. 

A curriculum contains but two absolute indispensables. 
These are purposes and those particular types of abilities 
called skills. Purposes are unquestionably primary; skills- 
are next; information and similar types of abilities are least 
important. The importance of purposes has already been 
emphasized. Those who are asking for a different emphasis 
in education are really asking for purposes. Most of the 
abilities listed as being the higher values of education, really 
are purposes. ^ Most of the problem of what is called social- 
izing pupils reduces to a problem of inculcating purposes. 
Most individuals possess sufficient ability to be honest, 
courteous, sympathetic, cooperative, generous, unselfish, 
and all the rest of the Christian virtues. Their lack is not 
ability but purpose. 

Skills are a close second to purposes in worth. Skills are 
methods of work. How to manipulate numbers, how to 
write, how to compose, how to read, how to use books so as 
to find a bit of desired information, how to evaluate material, 
how to think, etc., all these are skills which match purposes 
in worth. The pupil who has learned how to work and 



Measurement in Teaching 147 

learn, and who is provided with a rich set of purposes may- 
be turned loose to educate himself. 

Information is least in importance. Most of what all of 
us once learned, we have forgotten without regret. But not 
for anything would we give up our purposes and our skills 
which enable us to learn new information or relearn the old 
when needed. Many skills can only be developed by exer- 
cising them upon knowledge, facts, or information. The 
materials of information and the like selected for whetting 
the mental skills should be those which will be most service- 
able for the realization of purposes. But all the time knowl- 
edge should, from the adult's point of view, be considered 
merely the by-product of the process of developing pur- 
poses and skills. 

Instead of objective tests causing an over-emphasis upon 
the skills to the detriment of purposes, their use will insure 
to skills just that emphasis provided by the curriculum and 
will prove to be the salvation of the higher values. When in- 
telligently used tests are merely instruments for realizing the 
curriculum. Like poison, steam engines, fire or any other 
potent force they require intelligent control. We do not 
trust fire to infants, and if there exists anywhere educators 
who do not subordinate tests to their curriculum, they are 
still in their professional infancy and should not be trusted 
to use tests. 

Tests will be the salvation of the higher values because of 
a natural human tendency to stress the tangible and visible. 
Just as a child will not put forth intense effort when he 
can see no results, so a teacher is not likely to spend much 
effort trying to develop a trait improvement in which neither 
she nor the child's parents can see. When a month's im- 
provement in handwriting or composition is invisible, a year's 
improvement in unselfishness, even though very important 
will scarcely tip the scales of consciousness. It is human 
nature to fix our faith to form, hence so long as the average 
of human nature remains what it is, we must not expect it 



148 How to Measure in Education 

to expend effort in producing invisible, unrewardable im- 
provements so long as it is permitted to produce visible 
rewardable changes. The moon pulls on the earth as well as 
on the sea but the earth tides interest few. Visibility and 
rewardability control the amount and direction of effort. 
The skills have been over-emphasized in the past, and always 
will be until we have either the thus- far-and-no- farther of 
tests or an educational magnifying glass which will make vis- 
ible what has before been invisible. Even though "We are 
such stuff as dreams are made of" we simply refuse to "Pipe 
to the spirit ditties of no tone." 

Give the Standard Reading Scale at the End of Each 
Month.— The teacher should give the Thorndike-McCall 
Reading Scale monthly. Pupils should be encouraged to 
look upon these monthly tests as important events. Each 
pupil and class should strive to exceed the next objective. 
After the tests have been scored, the results should be of- 
ficially graphed upon the pupil and class charts. 

When a pupil or a class fails to show progress both teacher 
and pupils should attempt to locate the cause. In a school 
where this program is now in operation the third grade, at 
the end of the first month, exceeded its goal for the end of the 
year. The fourth grade progressed three- fourths of the dis- 
tance toward its goal for the end of the year. The fifth 
grade made no progress whatever. The sixth and seventh 
grades made excellent progress. The eighth grade, like the 
fifth, made no progress whatever. Such startling differences 
as these call for careful investigation. Due to crude scoring 
units or to the unreliability of the test individual pupils will 
appear frequently to have made no progress, but this does 
not explain the lack of progress of a whole class. 



CHAPTER V 

MEASUREMENT IN EVALUATING THE 
EFFICIENCY OF INSTRUCTION 

II, 12 and 13. Determine the Final Accomplishment 

Quotient,^ — The score on the last standard test of each 
pupil should be converted into a reading age, and this actual 
final reading age should be divided by the estimated final 
mental age. The quotient is the final Accomplishment 
Quotient which, taken in conjunction with the initial Ac- 
complishment Quotient, is the final evaluation of the 
efficiency of the year's work. 

Line 1 1 of Table 8 shows that pupil A made a final score 
of 43, which, as line 12 indicates, is equivalent to a final 
reading age of 130. Line 9 shows him to have an estimated 
final mental age of 131 months. Dividing his final reading 
age by this mental age gives him an Accomplishment Quo- 
tient of 99. Thus pupil A has substantially attained his 
objective. Pupil B's Accomplishment Quotient markedly 
improved during the year, thereby showing that he greatly 
exceeded his minimum objective. Nevertheless his achieve- 
ment is still considerably below what it should be. Pupil C 
not only maintained his high initial Accomplishment Quotient 
of 112, but pushed it 6 points higher. 

The mean initial reading score for the entire class was 
36.4. The reading score objective was 39.8. The actual 
final reading score made was 41.2. Thus the class consid- 
erably exceeded its minimum objective. The mean initial 
Accomplishment Quotient was 100. The final Accomplish- 
ment Quotient was 103. Irrespective of what the contest 

^ Those portions of Chapters III, IV, and V which have particular reference to 
reading are brought together in somewhat modified form to make a chapter in 
the Teachers' Manual which accompanies The Child's World Readers, published by 
Johnson Publishing Company, Richmond, Va. 

149 



150 How to Measure in Education 

between classes revealed (results not presented here) both 
the teacher and pupils of our illustrative fourth-grade class 
know that they have a right to be proud of the record made. 

Rate the Efficiency of Teaching. — Do the results from 
standard tests given to a class reveal the efficiency of the 
teacher of that class? They do and they do not. They do 
provided certain conditions obtain. These conditions are 
roughly obtainable by an experimental control of the situa- 
tion. They do not, because the conditions necessary for a 
just evaluation of a teacher's efficiency rarely obtain in the 
ordinary uncontrolled testing situation. 

All told more harm may easily be done than good. The 
use of tests for the guidance and diagnosis of pupils is so 
much more vital than their use to evaluate teachers that the 
former value should not be lost through antagonizing teach- 
ers in order to obtain the latter. For some time to come, at 
least, tests had better be used to measure pupils and not 
teachers, except in so far as teachers measure their own 
efficiency or cooperate in its measurement. When tests have 
reached a state of development where their use will lead to a 
just evaluation, the really efficient teachers will themselves 
demand to be rated by means of tests in order to escape 
another method whose accuracy is such that educators tol- 
erate it only because nothing else has been available. 

Since, however, teachers and supervisors are both likely 
to demand in the near future that their work be evaluated in 
a more scientific and hence more impersonal manner, there 
is summarized below what I conceive to be the fundamental 
assumptions underlying a scientific procedure for rating and 
promoting teachers and supervisors as well as the steps in 
the process of making such ratings. 

1. The pupil is the center of gravity or sun of the edu- 
cational system. Teachers are satellites of this sun and 
supervisors are moons of the satellites. 

2. All the paraphernalia of education exist for just one 
purpose, to make desirable changes in pupils. 



Evaluating the Efficiency of Instruction 151 

3. The worth of these paraphernalia can be measured in 
just one way, by determining how many desirable changes 
they make in pupils. 

4. Hence the only just basis for selecting and promoting 
teachers is the changes made in pupils. 

5. Teachers are at present selected and promoted pri- 
marily on the basis of their attributes, such as intelligence, 
personality, physical appearance, voice, ability in penman- 
ship and the like. 

6. No one has demonstrated just what causal relation- 
ship, if any, exists between possession of these various attri- 
butes and desirable changes in pupils. The relation between 
possession of certain attributes and the degree of favor of a 
teacher in the inspector's eyes is more evident. Dr. Chassell 
in her Ph.D. thesis determined the correlation between cer- 
tain features of Ph.D. students of education and later suc- 
cess. She found that the score made in Ph.D. matriculation 
examinations at Teachers College correlated with success 
about .50. The quality of their Ph.D. dissertations corre- 
lated about .50. The letters of recommendation written 
about these Ph.D.'s correlated about .30. Their handwrit- 
ing correlated .20, and their photograph .10. The follow- 
ing showed substantially zero correlation with later success: 
physical defects, type of locality of birthplace, age of reach- 
ing a given academic status, study abroad, size of family, 
church relationship, reading knowledge of languages, and 
travel abroad. This study is more valuable in the present 
connection for the technique it exemplifies than for its con- 
clusions. The subjects were not typical teachers but 
Ph.D.'s. The criterion of success was not demonstrated 
changes in pupils but the opinion of judges. 

7. Scientific measurement itself is fair only when we 
measure the amount of desirable change produced in pupils 
by a given teacher. The measurement of change requires 
both initial and final tests. The plan outlined below pro- 
vides for these. 



152 Bow to Measure in Education 

8. Scientific measurement is fair only when we measure 
amount of change produced in a standard time. This re- 
quirement can be satisfied. 

9. Scientific measurement is fair only when we measure 
the amount of change in standard pupils. The Accomphsh- 
ment Quotient is included in the plan below because this is 
a device for converting pupils, no matter what their intelli- 
gence, into standard pupils. 

10. Scientific measurement is fair only when the measure- 
ment is complete. Absolute completeness would require a 
measurement of the amount of changes made in children's 
purposes as well as their abilities. Absolute completeness 
is of course impossible and is in fact not necessary; partly 
because a chance sampling of the changes made will be 
thorough enough, and partly because teachers' skill in mak- 
ing desirable changes in, say, reading is probably positively 
correlated with their skill in making desirable changes in, 
say, arithmetic. 

The technique for satisfying the foregoing requirements 
and evaluating a teacher's efficiency in, say, reading follows: 

1. Determine the initial reading score of the pupils in the 
teacher's class (line 2 Table 8). 

2. Convert initial reading score into reading age (line 3 
Table 8). 

3. Determine each pupil's mental age at the same time 
the initial reading test is given (line 5 Table 8). 

4. Divide mental age by chronological age to get I.Q. 
(line 6 Table 8). 

5. Divide reading age by mental age to get A.Q. in read- 
ing (line 7 Table 8). 

6. Estimate final mental age for the end of the teaching 
period (line 9 Table 8). 

7. Determine final reading score at the end of the teach- 
ing period (line 11 Table 8). 

8. Convert final reading score into final reading age (line 
12 Tables). 



Evaluating the Efficiency of Instruction 153 

9. Divide final reading age by estimated final mental age 
to get final A.Q. (line 13 Table 8). 

10. Subtract mean initial A.Q. in reading from mean 
final A.Q. in reading (line 14 Table 8). 

If the mean difference is zero the teacher has been t}^!- 
cally efficient. If the difference is below zero the teacher 
is below average in efficiency. To the extent that the mean 
difference is above zero, just to that extent the teacher has 
shown superior efficiency in teaching reading. 

11. Repeat steps i, 2, 5, 7, 8, 9, and 10 for arithmetic 
and the mean A.Q, difference will be an index of the teach- 
er's efficiency with arithmetic. In similar fashion, her ef- 
ficiency at teaching other measurable abilities could be deter- 
mined. 

12. Compute the mean of these various mean A.Q. dif- 
ferences to get a final determination of the teacher's effici- 
ency. Franzen has defined a teacher's efficiency in terms, 
then, of the following formula where N is the number of sub- 
jects tested or tests given. 

rr. u T^az - (Read. A.Q. Diff.) + (Arith. A.Q. Diff.) etc. 
Teacher Efficiency = -^^ — — — ^^ — — ■ 

In the case of our illustrative fourth-grade class the teach- 
er's efficiency was as follows: 

Teacher Efficiency = ^ + 3 

This formula may be used not only for the rating of teach- 
ers but also for selecting new teachers who are given a half- 
year's or year's trial. 

Crude as the proposed method of selection is, it is fairer 
than present methods. The superintendent doubtless solilo- 
quizes like Caliban upon Setebos as the applicants march 
before him with a sample of painstaking penmanship in one 
hand and an antique photograph in the other. 

"Am strong myself compared to yonder crabs 
That march now from the mountain to the sea; 
Let twenty pass and stone the twenty-first, 



154 How to Measure in Education 

Loving not, hating not, just choosing so. 
Say, the first straggler that boasts purple spots 
Shall join the file, one pincer twisted off; 
Say, this bruised fellow shall receive a worm, 
And two worms he whose nippers end in red; 
As it likes me each time, I do: so He." 

Rate the Efficiency of Study. — Henry is a relatively 
stupid boy but his father doesn't know it. The teacher 
doesn't know it. The teacher considers him lazy. Two 
years ago Henry was in a class with pupils of his own age. 
Owing to his low intelligence he was hopelessly outclassed 
and as a consequence was failed by his teacher. When the 
father received the report, he and Henry had a dramatic 
session in the woodshed. 

Henry repeated the work of the grade with another class 
which happened to be younger and stupider than usual. As 
a result »of this fortunate combination in his competitors, 
Henry's father received at the end of the year a good report 
of Henry's work. Henry's teacher is happy because she 
thinks she succeeded so much better with him than did his 
former teacher. The former teacher is happy because she 
thinks it was her courage in failing him that paved the way 
for a moral reformation. Henry's father is happy because 
he considers he knew just exactly the right stimulus to use 
to motivate Henry's study. Henry is happy because he is 
not as unhappy as he was a year before. 

This year Henry is fighting a losing battle in competition 
with those who are intellectually superior. He already sees 
that he is headed straight for another failure which does not 
worry him, and another thrashing which concerns him 
greatly. Thus every other year Henry will receive his inev- 
itable thrashing until he is strong enough to physically rebel. 

Henry is not one child but a million children in this land 
of justice. These million are yearly subjected to such in- 
justice because reports sent to parents are misleading. 

The intellectually superior children suffer as much as or 
more than stupid children, but their suffering is of a different 
type. Most gifted children are working far below their 



Evaluating the Efficiency of Instruction 155 

optimum level of efficiency simply because no one suspects 
their real possibility. Since they lead their classes without 
difficulty and hence secure all the rewards which additional 
effort would bring there is no motive to exceed their present 
rate of progress. 

How may a just report be made? The Accomplishment 
Quotient not only measures the efficiency of a whole school 
but it also yields the fairest measure of the extent to which 
a pupil has progressed in proportion to how much he was 
capable of progressing. Hence the Accomplishment Quo- 
tient is at least one of the measures which should be sent to 
parents. 

A measurement of the efficiency of pupils is likewise useful 
in conferences with parents. Consider for a moment how 
much more useful a principal could make himself if he pos- 
sessed for every pupil in his school the information shown in 
Table 2. Presented with an array of such impartial facts, 
the parent who came to scoff would remain to pray. Parents 
who came earnestly seeking means to cooperate would not 
go away empty of fruitful suggestions. 

Fortified with such information the principal would be 
equally useful in conferences with teachers and pupils. 
Given such a detailed knowledge of the conditions in the 
school and of the problems with which each teacher is con- 
tending, the principal could add to his teacher's respect for 
his superior power, a respect for his superior knowledge. 
As the situation now stands the average principal must hon- 
estly confess that the rank and file teachers know far more 
than he of the real condition of the school. Finally there 
would be innumerable instances where such information 
would enable him more intelligently to confer with pupils, to 
deal with discipline cases, and to supervise the instruction 
of individual children. 

Experimental Selection of Methods and Materials of 
Study, Instruction, and Supervision. — Here stands a 
pupil — the Alpha and Omega of all educational effort, the 
center of gravity of the educational universe. Everything 



156 How to Measure in Education 

that Midas touched turned to gold. Everything that touches 
a pupil shows whether it is gold. Teacher, supervisor, prin- 
cipal, superintendent. United States Commissioner of Edu- 
cation, materials, methods, normal schools, this book, 
educational tests, the educational philosopher who confines 
himself solely to a contemplation of the ultimate, all these 
show whether they are gold or dross by the efficiency they 
show in altering the synaptic connections of this pupil's 
neurones. If no one of the above produces any desirable 
change in the pupil they are educationally without worth. 
Educational measurement is distinctive in that it must show 
the educational efficiency of all things and then in the last 
great experiment show whether it too has or has not value. 
Thus measurement alone possesses the power of self-destruc- 
tion. And its worth like the worth of all else depends upon 
the amount and value of the changes it can produce in this 
pupil. 

If probability may be so crowded as to assume that all of 
the above have an educational value above zero^ the next 
questions become: Which of two methods is more efficient? 
Which of two teachers or text-books is more efficient? 
Which of two practice tests? Which of two forms of organ- 
ization? As has previously been suggested, an absolutely 
just answer to each of these questions requires a carefully 
controlled scientific experiment. 

Space will not permit more than a listing of the ideal con- 
ditions aimed at by one of the simplest types of educational 
experiments, namely, the equivalent groups experiment. 

I. Two groups of pupils which are absolutely equivalent 
as shown by absolutely accurate and adequate initial mea- 
surements. The two groups must be more than equivalent 
as to averages. Every pupil in one group must be paired 
by an equivalent pupil in the other group. Increasing the 
size of the group will usually aid in securing equivalence. 
While in theory the initial measurements should be abso- 
lutely thorough, in practice they are seldom more than an 
accurate and adequate measurement of those abilities which 



Evaluating the Efficiency of Instruction 157 

the methods or materials or whatever is being contrasted 
are expected to alter. 

2. Except for the experimental factors the maintenance 
of absolutely identical conditions for the two groups for the 
entire period of the experiment. If the purpose of the ex- 
periment is to decide which of two methods of teaching silent 
reading is the more effective, the conditions surrounding 
each group are kept identical except that Group A is taught 
silent reading by Method A and Group B is taught by 
Method B. 

3. The maintenance of each experimental factor at ex- 
actly the desired intensity. 

4. An absolutely accurate and adequate measurement of 
the final ability of each group of pupils. 

5. A thoroughly just evaluation of the total worth of all 
changes occurring in Group A as compared with Group B. 

6. A final conclusion which is formulated in the light of 
the type of pupils used as subjects, and the intensity of each 
experimental factor. 

7. A statement of the statistical reliability of the conclu- 
sion, if there is the least suspicion that the ideal requirements 
have not perfectly obtained. In actual practice this, of 
course, means that the statistical reliability of the conclu- 
sions should always be stated. 

Efficiency Measurement of Schools and School Sys- 
tems. — In extensive surveys of schools and school systems 
the careful methods of evaluating efficiency described in the 
preceding pages must usually give way to a cruder and more 
rapid method. Such a rapid method is illustrated in 
Table 14. 

Table 14 is read thus: On the Thorndike Reading Scale 
Alpha 2, the 1919 median for Grade III is 5.5. The norm 
is 8.0. The difference between the 1919 median and norm 
is a negative 2.5. The grade unit is negative 0.6. Stated 
differently, Grade III is below norm 2.5 which is equivalent 
to being below norm 0.6 of a grade. Grade IV is below 
norm one entire grade. Grades V, VI, VII, and VIII are 



IS8 



How to Measure in Education 



TABLE 14 

Comparison of the Efficiency of School X for Each Test for Each Grade with 
Normal Efficiency. (Data from Table 2.) 



Thorndike Reading Scale Alpha 2 
III IV V VI VII 


VIII 


All 
Grade 
Mean 


VI-VII 
VIII 

Mean 


5.5 10.8 
8.0 15.0 

— 2.5 —4.2 

— 0.6 — I.o 


18.0 23.0 24.6 
20.0 24.0 28.0 

— 2.0 — 1.0 — 3.4 

— 0.5 — 0.2 — 0.8 


25-9 
30.0 

— 4.1 

— 0.9 


— 0.7 


— 0.6 


Trabue-Kelley Completion 








16.5 29.0 
18.5 25.0 

— 2.0 4.0 

— 0.4 0.8 


38.5 42.5 48.3 

30.5 35.0 40.0 

8.0 7-5 8.3 

1.6 I. 5 1.7 


55. 5 

43.5 

12.0 

2.4 


1.3 


1.9 


Thorndike Vocabulary Scale A2X 








21.5 51.5 
30.0 65.0 

— 8.5 — 13.5 

— 0.5 — 0.8 


81.5 106.5 108.5 
83.0 95.0 108.0 

— 1.5 II-5 0.5 

— 0.1 0.7 0.0 


109.5 
117. 

— 7-5 

— 0.4 


— 0.2 


0.1 


Ayres 


Spelling Test 








13-5 21.5 
19.6 30.4 

— 6.1 — 8.9 

— 0.9 —1.3 


30.5 47-5 48.3 
37-8 47.7 50.3 

— 7-3 — 0.2 — 2.0 

— 1.0 0.0 — 0.3 


55.5 

54-4 

I.I 

0.2 


— 0.6 


0.0 


Woody Addition Scale, Series B 








7.1 12.6 
9.0 1 1.0 

— 1.9 1.6 

— 1.0 0.8 


12.0 14.0 15.5 
14.0 16.0 18.0 

— 2.0 — 2.0 — 2.5 

— I.I — I.I — 1.3 


IS.8 

18.5 

— 2.7 

— 1-4 


—-0.9 


— 1-3 



1 9 19 Median 

Norm 

Difiference 
Grade Unit . 



191 9 Median 
Norm 

Difference 
Grade Unit . 

19 19 Median 

Norm 

Difference .. . 
Grade Unit . 

19 19 Median 

Norm 

Difference 
Grade Unit . 



1919 Median ... 

Norm 

Difference 

Grade Unit 

Woody Subtraction Scale, Series B 

1919 Median 0.7 y.z 8.8 10. o 12.2 12.8 

Norm 6.0 8.0 10. o 12.0 13.0 14.5 

Difference — 5-3 — 0.7 — 1.2 — 2.0 — 0.8 — 1.7 

Grade Unit , — 3.1 — 0.4 — 0.7 — 1.2 — 0,5 — 1.0 

Woody Multiplication Scale, Series B 

1919 Median 3-9 7-5 8.7 10.7 14.1 12.5 

Norm 3-5 7-o n-o iS-o i7-o 18.0 

Difference 0.4 0.5 — 2.3 — 4.3 — 2.9 — 5.5 

Grade Unit o.i 0.2 — 0.8 — 1.5 — 1.0 — 1.9 

Woody Division Scale, Series B 

1919 Median 3-4 5-4 4-9 8.0 10. i 9.8 

Norm 3.0 5.0 7.0 10. o 13.0 14.0 

Difference 0.4 0.4 — 2.1 — 2.0 — 2.9 — 4.2 

Grade Unit 0.2 0.2 — 1.0 — 0.9 — 1.3 — 1.9 

Nassau County Composition Scale 

1919 Median 2.5 3.0 4.1 4.7 S-o 6.2 

Norm 2.1 2.6 3.0 3.6 4,1 4.8 

Difference 0.4 0.4 i.i i.i 0.9 1.4 

Grade Unit 0.7 0.7 2.0 2.0 1.7 2.6 

Mean Grade Unit. — 0.6 — o.i — 0.2 — o.i — 0.2 — =■ 0.3 



1.2 — 0.9 



— 0.8 — 1.5 



— 0.8 — 1.4 



1.6 
0.2 



2.1 
0.2 



Evaluating the Efficiency of Instruction 159 

below norm respectively, 0.5, 0.2, 0.8, 0.9 of a grade. All 
grades average 0.7 of a grade below norm. The last three 
grades average 0.6 of a grade below norm. The other tests 
are read in the same way. The grand total mean for each 
of the last two columns for all tests is in each case 0.2 of a 
grade below norm. 

The only new procedure in Table 14 is the computation of 
grade units. It is clear that differences between 191 9 
medians and norms are not comparable from test to test. 
The difference of 1.4 for Grade VIII on the composition 
scale is, as shown, equivalent to 2.6 grades while a difference 
of 1.5 for Grade V on the vocabulary scale is equivalent to 
only 0.1 of a grade. A difference of 1.4 on the composition 
scale means much because the total progress of the norm 
from Grade III to Grade VIII is only (4.8-2.1) or 2.7. A 
difference of 1.5 on the vocabulary scale means little because 
the total progress of the norm from Grade III to Grade VIII 
is (117-30) or 87. Before the differences on the various 
tests could be combined they had to be made comparable. 
Comparability was secured by reducing all difference to 
grade-unit differences. 

The computation of the above grade units was as follows: 
Each difference was divided by the mean amount of norm 
progress for each grade for the test in question. On the 
Thorndike Reading Scale Alpha 2 the norm for Grade III is 
8.0, for Grade VIII is 30.0. The progress is (30.0-8.0) or 
22. The mean progress per grade is (22-^5) or 4.4. On 
an average, then, the typical pupil progresses 4.4 points for 
each grade. This means that if any grade of School 4 is 4.4 
below the norm, it is really (4.4-^-4.4) or i.o grade or i.o 
grade unit below norm. Thus in Grade III, the difference is 
negative 2.5, which, divided by 4.4, gives negative 0.6 of a 
grade unit. The difference for Grade IV is negative 4.2 
which, divided by 4.4, gives approximately negative 1.0 grade 
unit, and so on. The mean norm grade progress for the com- 
pletion test is 5.0, hence all differences for this test are di- 
vided by 5. 



i6o Bow to Measure in Education 

The grade units might have been computed in many other 
ways. The mean grade progress might have been the grade 
progress of School X instead of the norm. An objection to 
this is that the meaning of the grade unit would vary with 
the school being studied. By using norm progress, one 
school which is one grade unit below norm is equivalent to 
another school which is one grade unit below norm. 

Again, instead of dividing each difference by the same 
mean, each difference might have been divided by the inter- 
val between the two adjoining grades which are nearest to the 
difference, i. e., the third-grade difference might have been 
divided by the norm progress of Grade IV over Grade III. 
But occasionally the norm score for Grade IV is less than or 
exactly equal to that for Grade III. When the norm scores 
are equal, the progress is zero. Any difference divided by 
zero gives an infinite quotient. 

A still better method is to divide each difference by the 
mean of the two or three nearest grade intervals. This 
would partially avoid a difficulty in the method used by 
me. My method somewhat exaggerates the differences for 
the lowest and highest grades, due to the fact that progress 
is usually most rapid in the lowest grades and least rapid in 
the highest grades. The procedure of dividing by the mean 
of the few adjoining intervals was not adopted partly because 
this makes the computation rather laborious, and partly for 
other reasons. 

The educational surveyor does not always find his data 
in such form that he can proceed forthwith to compute grade 
units. I met just such a situation in measuring the reading 
ability of the white and colored elementary schools of Balti- 
more. The problem and its solution is presented in Table 
15. The Baltimore schools were measured at the close of 
November, whereas the norm and the achievement of partic- 
ular school systems were known for the last of June. 

To make all scores comparable Baltimore's scores for both 
white and colored schools were computed forward from No- 
vember to June, i. e., about seven-tenths of the school year 



Evaluating the Efficiency of Instruction i6i 

later. The process for the white schools was as follows: 

(44.9— 39.6)X.7 = 3-7' 39-6 + 3-7 = 43.3- (49.0—44.9) 
X.7 = 2.9. 44.9 + 2.9 = 47.8 and so on for grades VI 
and VIII. What the eighth grade will be in June presents 
a special problem which can be solved by making use of the 
norm or the achievement of the average school system, thus: 
(66.9-58.3)-^(58.3-48.o)=.25. (58.1 - 47-8) X.25 
t=2.6. 59.5 +(2.6 X-7)= 61.2. By locating the scores 
used and the results secured in Table 15 the reader will have 
little difficulty in following the computation and the reason 
for each step. 

TABLE 15 

Comparison of the Grade Scores on the Thorndike-McCall Reading 
Scale for the White and Colored Elementary Schools of Balti- 
more with Grade Scores for the Average School System and 
the Mean Achievements of Particular School Systems 



Grade 




IV 


V 


VI 


VII 


VIII 


Baltimore white 


(Nov.) 


39.6 


44.9 


49.0 


54-9 


59-5 


(■( (( 


- (June) 


43.3 


47.8 


53.1 


58.1 


61.2 


" colored 


(Nov.) 


35-4 


39.7 


42.5 


46.9 


47.0 


tX (( 


(June) 


38.4 


41.7 


45.6 


47.0 


47.9 


Average School System 


(June) 


41.8 


48.0 


53-7 


58-3 


60.9 


Paterson, N. J. 


(June) 


35.5 


40.9 


49.0 


517 


53-5 


33 Wisconsin Cities 


(June) 


40.9 


47.2 


52.6 


55-3 


58.0 


Louisville, Ky. 


(June) 


39-1 


43-6 


51-7 


59.8 


60.7 


St. Paul, Minn. 


(June) 


41.8 


46.3 


53-5 


58.0 


62.5 


18 Indiana Cities 


(June) 


49.9 


58.9 


67.0 


68.8 


71.5 



Table 16 illustrates two ways, grade unit and relative 
position, which were used for reporting the efficiency of Bal- 
timore's schools. In making out Table 16 only comparable 
June scores were used. 

Interpretations and Functions of Efficiency Measure- 
ments.— What may we infer about the efficiency of School X 
from the data of Table 14? We are in a position to evaluate 
the achievement of (a) each grade on each test, (b) each 
grade on all tests, (c) all grades on any test, and (d) all 
grades on all tests. 



l62 



How to Measure in Education 

TABLE I 6 



A Table to Int;erpret the Data in Table 15 When Proper Allowance is Made 

for the Time of Testing 





White Schools 








Grade 


IV 


V VI 


VII 


VIII 


Mean 


Grades Baltimore is 
ahead of Average 
School System 


0.3 


0.0 — O.I 


0.0 


0.1 


0.1 


Rank of Baltimore 
among six School 
Systems 


2 


2 3 


3 


3 


2.6 







Colored Schools 



Grades ahead of Aver- 
age School System 


— 0.7 —1.3 —1.7 —24 —2.7 


— 1.76 


Rank among Six 
School Systems . . . 


5 S 6 6 6 


5.6 



It would be too tedious to list the achievement of each 
grade on each test. The achievement of each grade on all 
tests is shown at the bottom of the table. Every grade, 
without exception, is from o.i of a grade to 0.6 of a grade 
below norm. Grade III and Grade VIII are farthest below 
norm. 

Either the third grade teacher is less efficient than the 
other members of the teaching staff or else there is some 
special explanation. Perhaps the third-grade had not begun 
the study of certain abilities measured by the tests. It is 
almost a shock to discover that the children know almost 
nothing about subtraction and yet they are ahead of norm 
in division. For some reason every fundamental except sub- 
traction has been taught. The teacher was sadly in need 
of a simple standard test to regulate her educational empha- 
sis. Low mentality of pupils may explain the backwardness 
of the class, but it does not explain the misplacement of 
emphasis. To properly interpret the results for Grade III 
requires a more intimate knowledge of conditions than I 



Evaluating the Efficiency of Instruction 163 

possess. The third-grade teacher may be the most efficient 
teacher in the school. 

The achievement of all grades on each test, shown in the 
last two vertical columns of the table, brings to light some 
interesting facts. The school is from one to two grades 
ahead of standard on the completion scale. This might be 
attributed to unduly low norms were the results not in har- 
mony with the composition test, which, it is claimed, mea- 
sures approximately the same ability. On the reading scale, 
vocabulary scale, and spelling scale, the school is either up 
to norm or slightly below. In every arithmetic fundamental, 
without exception. School X is below norm about one grade. 
If the norm is accepted as a legitimate goal, the school needs 
to give special attention during the coming year to reading 
and to the fundamentals of arithmetic. 

The achievement of all grades on all tests, the final 
measure of the school's efficiency, is shown as — 0.2 and 
— 0.2 at the bottom of the table. The data of the entire 
table, when thus summarized, shows that the school is two- 
tenths of a grade below the typical school. This conclusion 
probably holds not only for the tests used but for all the 
work of the school, if it were possible to measure and aver- 
age it. 

Education must justify itself in the eyes of the public 
which it serves. The hour has struck for a testing of every 
institution in existence. Not even as sacred an institution 
as the public school can escape the searching investigation 
of its critical patrons. Those whose business it is to control 
the distribution of the community's funds are asking more 
and more that each agency which serves the community 
show cause why it deserves the appropriation requested. 

Proof of the school's worth should be as scientific as pos- 
sible. The time has passed when a gullible public will 
accept as proof of the school's social value the fact that 
graduates of the elementary school are socially more valuable 
than illiterates and that graduates of the high school are 
worth more to the community than those who left school at 



164 How to Measure in Education 

the end of the eighth grade. The only proof that will finally 
be accepted is verifiable changes in pupils of demonstrable 
social worth. 

The best method yet discovered for measuring changes in 
pupils in such a way that conclusions may be verified is to 
employ standard educational tests. They are the friends of 
both teachers and public. They are a leaven working for 
an unpolitical scientific evaluation of those who operate the 
schools. They protect the efficient and expose the inef- 
ficient. 

Educational tests cannot be safely used to evaluate a 
school's efficiency without the exercise of ordinary discretion. 
A school can be compared with its own aims, provided these 
goals have been wisely located, but not with other schools 
without conditioning all conclusions upon certain possible 
factors. A school should never be judged except in the 
light of these conditioning factors. 

One conditioning factor is the permanence of the school 
population. A few years ago there were in a certain large 
city many more apartments for rent than there were families 
to rent them. As a result agents would pay moving ex- 
penses and frequently give a bonus of a month's rent to 
secure tenants. As a consequence of this and other causes, 
seventy-five per cent or more of the population of certain 
schools changed every year. Few pupils in the upper grades 
of many schools have been in any one school during their 
entire school career or even a large portion of it. In test- 
ing the pupils in such a school one is evaluating the efficiency 
of neighboring schools almost as much as the school being 
tested. The effect of such impermanence may be to make 
the efficiency of the school appear too low, too high or just 
right. School X has a relatively permanent school popu- 
lation. 

A second conditioning factor is the intellectual calibre of 
the pupils. To a certain extent their intellectual calibre 
may be judged by the general social status of the community. 
On the whole, the unprosperous, undesirable portions of a 



Evaluating the Efficiency of Instruction 165 

city produce children with an average mentality below that 
of the children coming from the wealthier sections. Fifth 
Avenue, New York City, may be more wicked than the lower 
East Side but it has more intelligence. Determination of 
intellectual calibre by means of intelligence tests is, however, 
more accurate than estimates from social status. Pupils in 
School X, according to estimates from social status, will 
average below normal intelligence. Whether this estimate 
is correct is to be checked later by tests. The formula for 
A.Q. provides a procedure for discounting this important 
conditioning factor. 

Bound up with this is a third conditioning factor, namely, 
the educative value of the home environment. In some com- 
munities the parents probably teach as much as the school, 
while in the others the home actually discourages study. 

A fourth conditioning factor is the amount of chronolog- 
ical retardation or acceleration. How to make allowance 
for this factor was described in the Directions for Using 
Thorndike-McCall Reading Scale. 

A fifth conditioning factor is the increase in ability which 
comes from mere maturing. This factor has little signifi- 
cance if all that is desired is a measure of comparative ef- 
ficiency. If it is desired to determine the absolute improve- 
ment produced by the school, any improvement due to mere 
maturing must be subtracted from the total improvement. 
The actual separation of the contributions of maturity and 
those of training has never been done. An elaborate experi- 
ment would be required. 

Courtis ^ attempted this separation in the survey of the 
Gary schools. He compared the curve of progress with age 
in certain mental functions which it is reasonable to assume 
are increased mainly by mere maturing with the growth 
curve for mental functions increases in which are the con- 
scious objectives of instruction. When the form of the two 
curves was similar he inferred that maturity and not specific 

' S. A. Courtis, The Gary Public Schools : Measurement of Classroom Products; 
General Education Board, New York, 19 19. 



1 66 How to Measure in Education 

training should have the credit for the increase in the educa- 
tional trait. It might be argued with equal force that the 
mental function which the school was attempting to improve 
would not have increased at all without specific training, that 
Nature makes provision for increases in finger length but 
not for increases in spelling ability, and, finally, that the 
school whose pupils show growth curves in school traits 
shaped like maturity curves are schools which have con- 
tinually and perfectly adjusted their curriculum to increases 
in the capacity of pupils, if not to the social needs of the 
community. 

A sixth conditioning factor, especially where the efficiency 
of a teacher of only one or two subjects is being determined, 
is the transfer of the general training of the school or the 
specific training of other teachers to the abilities under con- 
sideration. Many teachers of special subjects would be 
chagrined to discover how much of the annual progress of 
their pupils is due to the general or specific training of their 
colleagues. 

A seventh conditioning factor is the distribution of em- 
phasis. A school whose pupils have a different social and 
vocational distribution may legitimately elect to emphasize 
certain subjects more and other subjects less than the typical 
school. Before selecting the tests for a school to be evalu- 
ated it is well to study the emphasis of the instruction as 
shown by the time schedule, and for the principal and 
teacher to state the subjects in which they expect the school 
to make the best showing, the next best and so on to the 
poorest. Tests can then be selected which will not be 
especially favorable or unfavorable to the school. 

When several tests are used and when they are fairly 
representative the likelihood of being led into an erroneous 
conclusion is greatly reduced. Had nothing but the four 
fundamentals of arithmetic tests been given to School X, for 
example, the judgment of its efficiency would have been a 
little too severe, for apparently the school has emphasized 
composition and related subjects to the detriment of the 



Evaluating the Efficiency of Instruction 167 

fundamentals of arithmetic and reading. The discovery 
that the school partially compensates an inferiority in cer- 
tain abilities with a superiority in others, makes us consider 
whether additional tests might not reveal that the school is 
up to norm. The fact that reading is below norm does not, 
however, hold out much hope, for reading is probably the 
key to more abilities than composition. This is enough to 
show the importance of remembering to interpret results in 
the light of the school's emphasis. 

There are many other conditioning factors which should 
not be overlooked. One is the length of the school term. It 
is not reasonable to expect certain southern rural schools to 
accomplish in a few months as much as city schools in ten 
months. The process of allowing for this is too simple to 
require description. Other factors which must be consid- 
ered are the training of the teachers, the equipment of the 
school, and the like. 

Dewey has suggested the desirability of a sort of exchange 
bureau for exchanging educational ideas. Standard tests are 
a particularly valuable means for communicating ideas and 
practices. It is one thing for a school to tell what it is doing, 
and another thing for it to show what it has achieved. Com- 
parisons of the achievement of the schools in one locality 
with that of schools in other localities will force a djscussion 
of what ability and how much of it is of most worth in edu- 
cation, and thus speed up scientific location of goals. The 
manifold differences in the achievements, emphasis, and ob- 
jectives of various schools are not so much indices of a 
healthy condition as of a fundamental ignorance of the worth 
of varying amounts of each ability. 

Again efficiency comparisons from school to school will 
hasten the development of a national conception of educa- 
tion. The formulation of a national program of education 
should be made in the light of what the backward schools of 
the country are now doing. Williams and Foote have begun 
the task of discovering whether one such region is really 
educationally retarded, and if so how much. Periodic sur- 



1 68 How to Measure in Education 

veys, by means of educational tests, of favorable and un- 
favorable educational environments would help to bring 
national aid to the schools which most need help. To educa- 
tional measurement must fall the tremendous task of national 
diagnosis, and of setting up minimum educational standards 
for the country as a whole. 



n^ 



CHAPTER VI 
MEASUREMENT IN VOCATIONAL GUIDANCE 

Functions of Vocational Guidance. — The investigations 
reported by Davis justify the conclusion of Parsons that we 
guide our boys and girls to some extent through school, and 
then drop them into this complex world to sink or swim, and, 
if they swim, to drift into some line of work by chance, 
proximity, or uninformed selection. 

Vocational guidance, then, has a genuine function, namely, 
to help each individual reach that particular vocational niche 
or, better, gateway which leads where he will most greatly 
benefit himself and most fully contribute to the good of all. 
This may mean, according to circumstance, an occupation 
with a fair wage, continual opportunity for self -improvement, 
and a prospect for advancement; or it may mean a job 
where the wage is small but which is in line with his ambition 
and in which he will receive the needed training; or it may 
mean only a temporary position which will provide the quick- 
est and largest financial return in order that he may be able 
to resume his school preparation for his chosen field. A 
realization of this function means that vocational guidance 
must deal not only with what Clark calls the "misfit human 
material that has formerly gone into society's scrap-heap," 
but with the misfit material that never reaches the scrap-heap 
stage. 

Another important function of vocational guidance is to 
bring into the pupil's education the drive of what President 
Eliot calls "the life career motive." Some pupils can joy- 
ously study, like the monk washed dishes, for "the glory of 
God," but such pupils are rare. To some no education is 

169 



1 70 How to Measure in Education 

liberal except one which ^'flutters in all directions and flies 
in none." There is no reason why an agricultural training 
should be essentially mean. Some of the most cultural 
phases of any education center about nature, the farm, and 
country !' «e. The most beautiful portions of all poetry is 
the poetry of nature. The testimony is well nigh universal 
that the life career motive increases studiousness, decreases 
elimination, and reacts most favorably upon the teachers and 
supervisors as well. 

These functions can be achieved only through ( i ) a care- 
ful survey of the various occupations to determine the con- 
stancy of demand for employees, whether the occupation is a 
seasonal or ephemeral one, the ratio of demand to supply, 
the monetary rewards, the nature and amount of other types 
of rewards, the working conditions in the occupation, etc.; 
(2) a study of the results of such a survey by the pupil, both 
to aid him to choose his own occupation intelligently and as 
an important part of his general education; (3) a testing 
in various ways of the pupil's ability for and interest in each 
of the occupations; (4) the choice by the pupil with the 
advice of a vocational counselor, of his vocation; (5) the 
provision of adequate vocational education; (6) appropriate 
educational guidance in the light of the chosen vocation; (7) 
vocational placement at the end of the pupil's educational 
preparation; and (8) a systematic follow-up of each pupil 
sent into industry. Only that portion of this total process 
which is most intimately related to measurement will be dis- 
cussed further. 

Intelligence Limits in Vocational Guidance. — ^A boy of 
twelve or a youth of twenty stands before some school 
official enquiring what occupation it would be advisable for 
him to enter or for which to begin preparation. What must 
the educator know before he can give wise advice, and how 
can measurement help in this intensely human situation? 

Sound advice requires the educator or vocational counselor 
to know the general intelligence limits of the various occupa- 
tions. This means that intelligence tests must be applied to 



Measurement in Vocational Guidance 171 

members of representative occupations. Terman ^ has 
made some progress in the determination of occupational 
intelligence limits. The overlapping of I.Q.'s for the differ- 
ent occupations is so great that some college students have 
less intelligence than some hoboes! The median I.Q. and 
more especially the Q ^ more nearly bring out the true facts, 
namely that success as a business man or college student 
requires an I.Q. considerably in excess of that which is typi- 
cal for hoboes, salesgirls, firemen, policemen, motormen, and 
conductors, 

A War Department bulletin on army ^ mental tests shows 
fhe intellectual level for various occupations as determined 
by the application of thousands of intelligence tests at the 
army cantonments. The scores on these tests, for occupa- 
tions shown, follows: 

45 to 49 — Farmer, laborer, general miner and team- 
ster. 

50 to 54 — Stationary gas engine man, horse hostler, 
horseshoer, tailor, general boilermaker, and barber. 

55 to 59 — General carpenter, painter, heavy truck 
chauffeur, horse trainer, baker, cook, concrete or 
cement worker, mine drill runner, bricklayer, 
cobbler, and caterer. 

60 to 64 — General machinist, lathe hand, general 
blacksmith, brakeman, locomotive fireman, auto 
chauffeur, telegraph and telephone lineman, 
butcher, bridge carpenter, railroad conductor, rail- 
road shop mechanic, locomotive engineer. 

65 to 69 — Laundryman, plumber, auto repairman, 
general pipefitter, auto engine mechanic, auto 
assembler, general mechanic, tool and gauge 
maker, stock checker, detective and policeman, 
toolroom expert, ship carpenter, gunsmith, marine 
engineman, hand riveter, telephone operator. 

70 to 74 — Truckmaster, farrier and veterinarian. 

^ Lewis M. Terman, The Intelligence of School Children, p. 286; Houghton 
MiiBin Company, N. Y., 19 19. 

2 Army Mental Tests, Methods, Typical Results and Practical Applications ; No- 
vember 22, 19 18, Washington, D. C, 



172 How to Measure in Education 

75 to 79 — Receiving clerk, shipping clerk, stock- 
keeper. 
80 to 84 — General electrician, telegrapher, band 

musician, concrete construction foreman. 
85 to 89 — Photographer. 
90 to 94 — Railroad clerk. 
95 to 99 — General clerk, filing clerk. 
100 to 104 — Bookkeeper. 
105 to 109 — Mechanical engineer, 
no to 114 — Mechanical draughtsman. 
115 to 119 — Stenographer, typist, accountant, civil 
engineer, Y. M. C. A. secretaries, medical officers. 
125 and over — Army chaplains, engineer officers. 

The first step is to utilize tests to define the intelligence 
limits of the various occupations. The second step in voca- 
tional guidance is to measure the individual to be guided to 
determine in which occupation level his intelligence falls. 
Then the vocational counselor is in a position to tell the 
pupil the work he is by intelligence fitted to do. The pupil 
can be informed that his intelligence approximately equals 
the average of that of individuals who are successfully en- 
gaged in, say, ten different occupations. The pupil may, 
if he chooses, decide for an occupation that is in the next 
intellectual level above, but he will not do so without being 
warned that the higher he aims above his natural level the 
smaller become his chances of success. Good luck, family 
pull, the possession of valuable accessory traits, etc., may 
cause him to ''get along" out of his intelligence element, but 
he should realize that the attempt would be a speculative 
one. 

Such a determination of a pupil's intelligence is not only 
advantageous to the pupil, it may be very profitable for an 
employer, particularly if the employer has an opportunity 
to choose among applicants. Recently an almost physically 
perfect youth was given an intelligence test by a member 
of our psychology department. The test showed him to be 
feebleminded. Shortly afterward he was employed as a 



Measurement in Vocational Guidance 173 

messenger boy by Wanamaker. A package entrusted to him 
disappeared. Detectives watched the boy and annoyed 
members of his family for several days. Later the package 
was found in the store where it had been carelessly dropped. 
At the end of the first week the boy was paid and dismissed. 
He lost his money before reaching home. Several other 
employers discovered their mistake by the same trial-and- 
error expensive procedure. Neither the boy nor his family 
nor his employer profited by these experiences. 

The determination of vocational intelligence limits and 
the placement of a pupil between these limits presents one 
great obstacle. We cannot now measure general intelligence. 
General intelligence is, according to Thorndike, composed 
of three intelligences, namely, abstract intelligence, social 
intelligence, and mechanical intelligence. 

Within any one of these intelligences an individual dis- 
plays great consistency, but while the correlation between 
any two intelligences is positive it is not as high as between 
abilities within any one intelligence. People who have ab- 
stract intelligence are able to deal successfully with abstract 
ideas. They are about equally successful with scientific 
principles, mathematical and chemical formulae, legal dis- 
tinctions and the like. They make capable lawyers, scien- 
tists, and theologians. Those endowed with social intelli- 
gence are most successful in handling human situations 
involving human relationships. If they have a social intelli- 
gence plus a strong instinct for mastery they tend to make 
successful business executives, army commanders, and other 
leaders of men. If the instinct for submission predominates 
they make satisfactory salesmen, politicians, saleswomen, or 
wives. Those with a mechanical intelligence prefer and are 
most competent to manage automobiles, motor boats, 
aeroplanes, washing machines, carpet sweepers and other 
mechanisms. 

To what extent this triple differentiation is caused by 
original nature, and to what extent by experience is un- 
known. That most men are stronger in one of these fields 



174 How to Measure in Education 

than in the others will be readily admitted. That the types 
continuously merge into each other and that most individuals 
who are very successful in one field could have learned to 
be a fair success in the others also, is not so generally under- 
stood. Nevertheless there is a sufficiently genuine distinc- 
tion to make the differentiation useful. 

These three intelligences cannot be measured with equal 
accuracy and ease. The traditional intelligence tests 
measure abstract intelligence primarily, though they do to a 
certain extent measure directly the other two also. Sten- 
quist and others have made some progress toward the con- 
struction of tests for measuring mechanical intelligence. 
The measurement of social intelligence in any satisfactory 
fashion is the most difficult and consequently least developed 
of the three. 

Moral and Physical Limits in Vocational Guidance. 
— An individual's vocational fitness depends upon his intel- 
lectual, moral, and physical abilities and purposes. Though 
individuals who are intellectually competent will be found 
to be on the average morally more dependable, the two are 
not identical by any means. There is a sufficint absence of 
correlation to prompt Hollingworth ^ to say: 

''I would rather trust my life and limb to a motorman 
whose feeble memory span is reenforced by a loyal devotion 
to the comfort of his grandmother than to a mnemonic 
prodigy whose chief actuating motive in life is to be a 'good 
fellow.' " 

The correlation between the physical and either mental or 
moral is, though positive, still less than between the mental 
and the moral. 

There are moral limits in occupations just as surely as 
there are intellectual limits. For certain types of occupa- 
tions the moral limits are exceedingly low. The employee 
in certain low-grade occupations need not be honest, indus- 
trious, sympathetic, courteous or much of anything else ex- 

• H. L. Hollingworth, Vocational Psychology, p. 217; D. Appleton & Co., 1916. 



Measurement in Vocational Guidance 175 

cept physically strong. A boss is in constant attendance to 
see that the employer's property is not stolen and that pro- 
pensities toward laziness do not injure his interests. We 
can contrast with this the minimum moral limit required to 
be a desirable Justice of the Supreme Court, commander of 
an army in time of war, or any of the numerous positions 
which carry with them tremendous character responsibili- 
ties. 

Similarly there are physical limits in occupations. Recent 
experience has made everyone conscious of the physical 
limits in a military vocation. The occupations of prize fight- 
ing and wrestling each have a high physical minimum. An 
individual with delicate health and puny physique would 
scarcely be guided into the occupation of felling timber, 
heaving coal, digging holes, or breaking stones. Neither 
would a one-legged youth be guided into a job as messenger 
boy, or into professional baseball. 

The measurement of an individual's physical abilities, in 
so far as they relate to health, strength and the like is now 
accurate enough for most practical purposes of vocational 
guidance. The records of medical inspection and the meas- 
urements of physical directors should be carefully preserved 
for the use of the school's vocational counselor. The meas- 
urement of more subtle yet important physical abilities such 
as beauty, handsomeness and other similar physical qualities 
cannot yet be done objectively. They must be measured 
somewhat as moral qualities are measured, so that the two 
may be considered together below. 

Unlike the measurement of intelligence the measurement 
of an individual's moral abilities and of certain subtle physi- 
cal qualities must, for the present at least, be entirely sub- 
jective. The measurements must be made, either in terms 
of what associates think of an individual's honesty, courtesy, 
loyalty, industry, beauty, etc., or in terms of what the indi- 
vidual himself thinks concerning how these qualities of 
his impress other people. These methods are respectively 
that of consensus of associates and self -analysis. 



176 How to Measure in Education 

The method of consensus of associates must be accepted 
as more accurate than the method of self -analysis, particu- 
larly if the associates are competent judges, are numerous 
enough, are sufficiently acquainted with the individual being 
judged, and have no conscious or unconscious motive for 
over-rating or under-rating the individual. The difficulty of 
securing judgments which even approximate these criteria 
makes it important to determine the accuracy of an indi- 
vidual's rating of himself. Presumably an individual knows 
himself better than any other individual, and, when seeking 
vocational guidance, he may have no decided motive for 
exaggerating his excellencies or minimizing his faults. 

Hollingworth reports in Vocational Psychology a particu- 
larly thorough study of the accuracy of self-analysis.* 
Cattell, Norsworthy, Wells, Fairchild, and others have also 
contributed to our knowledge of these measurements. Hol- 
lingworth selected from about one hundred and fifty women 
students in their third college year twenty-five students all 
of whom were acquainted with one another. He asked each 
student to rate all the twenty-five (the estimator included) 
in order for their possession of, say, humor, and then in 
order for conceit and so on for such other traits as neatness, 
intelligence, beauty, vulgarity, snobbishness, refinement, 
sociability, kindliness, energy, efficiency, and originality. 

His results lead to the following conclusions: (i) The 
general error of self-estimation tended to be half again as 
great as the average error of the student's associates. (2) 
Assuming the average rating of their associates to be cor- 
rect, there was an average constant error, in the case of 
self-estimation, toward an under-estimation of their posses- 
sion of undesirable traits and an over-estimation of their 
possession of desirable traits. They exaggerated most their 
possession of refinement, and the}^ minimized most their pos- 
session of vulgarity. There was no average constant error 
for self -estimation of beauty. This of course does not mean 

* For an outline for self-analysis see, Outline of a Study of the Self, by Yerkes 
and LaRue; Harvard University Press, 19 14. 



Measurement in Vocational Guidance 177 

that each student was egotistically inclined. Self-estimates 
from scientific men, secured by Cattell, showed no such 
constant error. (3) In the case of desirable traits, ability 
to judge self and others accurately accompanies possession 
of that quality, whereas in the case of undesirable traits the 
reverse is true. (4) The student who most accurately rates 
herself tends to most accurately rate others, though this, like 
all the conclusions stated above, varies with the trait in ques- 
tion. The reader is referred to Hollingworth's book not 
only for the details but for additional conclusions. 

For practical purposes the best evidence then is that an 
individual is one-half more inaccurate than the average in- 
accuracy of his or her associates. We do not know to what 
extent the error of self-analysis would have been reduced 
or increased had the students been men instead of women, 
or had the traits been more carefully defined, or had each 
student been instructed to rate herself in terms of what 
others think of her instead of what she thinks of herself, 
or had the situation been a genuine vocational-guidance 
situation. The method of self-analysis appears to be suf- 
ficiently accurate to have value in vocational guidance. 

Aptitude Limits for Vocational Guidance. — The pres- 
ence in rare individuals of a phenomenal aptitude for some 
one occupation provoked not only the interest of primitive 
peoples but gave rise to several systems of magic and incan- 
tation by which these aptitudes might be prophesied, and 
the individual thus guided into a vocation where wealth 
and honor awaited him. Any one desirous to know the task 
designed for him by the Fates need only have his horoscope 
read in the light of his particular star or from the month 
of his birth. If he doubts the reliability of this horoscope 
he can have it verified by a palmist who will read the lines 
in his hand, or a phrenologist who will feel the bumps on 
his head, or better still a physiognomist who judges any- 
thing from the individual's aptitude to the spiritual state of 
his soul by the shiftiness of his eye, loftiness of his brow, 
squareness of jaw, thinness of lips, generosity of ear, pro- 



178 How to Measure in Education 

trudingness of chin, distribution of dimples, shufflingness of 
gait, clamminess of hand, and fishiness of eye. 

Horoscopes have disappeared except from the advertising 
columns of frontier newspapers. Palmistry has found its 
worth level, namely, as an entertaining exercise. Phrenology 
is still a source of income for its practitioners and is still 
a potent determiner of social attitudes. Physiognomy is 
almost universally accepted and is daily practiced. 

All four methods, either individually or collectively, have 
zero or almost zero correlation with what is actually inside 
the cranium. Physiognomy has a slight but practically 
negligible correlation with those minute neural connections 
which are the really significant objects of investigation. 
But so long as an individual is certain to be judged by 
phrenological and physiognomical symptoms, phrenology 
and physiognomy have genuine significance for vocational 
guidance, for as Hollingworth states: ' 'Vocational success 
depends not only on the traits one really possesses, but also 
somewhat on the traits one is believed to possess.^' 

While there are instances of marked special aptitudes for 
just one occupation, these instances are so rare as to be 
practically negligible. Each individual probably has an 
aptitude for some one occupation more than for any other 
but for most of us it would probably require the exactness 
of the Infinite to distinguish the occupation. The truth is 
that most persons, so far as capacity is concerned, could 
pursue any one of a dozen different occupations with prac- 
tically equal chances of success. The same qualities which 
make for success in the medical profession also make for 
success in the engineering profession or any other profes- 
sion on approximately the same intellectual level. The idea 
that most of us have a marked ability for just one of these 
professions assumes that each profession demands the exer- 
cise of only a small proportion of our mental make-up. As 
a matter of fact each demands the totality of an individual. 
Each of the higher occupations requires for success an all- 



Measurement in Vocational Guidance 179 

round general ability and about the same all-round general 
ability. 

As a rule there are no round pegs and square pegs. There 
are big pegs and little pegs. The individual who is always 
bemoaning the fact that he is a square peg trying to adapt 
himself to a round hole is probably a little peg trying to fill 
a big hole. These persons, together with occasional indi- 
viduals of acknowledged high general abihty who have- 
failed, have given currency to the idea that square-pegness 
and round-pegness is the rule. Truly able individuals fre- 
quently fail in one occupation and succeed in another, not 
because they always possess a special aptitude for the latter 
only but because the accidents of circumstances chanced to 
be against them in the former occupation and for them in 
the latter. Luck is of course far less influential than ability, 
but its influence is nevertheless considerable. If we think 
of an individual's different abilities as being the spokes of 
a wheel, all these spokes tend to be of the same length and 
the tire of the wheel tends to be a perfect circle. In a few 
instances there are extreme departures from this circle. In 
all instances there is probably some departure. The chief 
difference between individuals is not that one projects in 
one direction and another in another direction, but that one 
is a big circle and the other is a little circle. 

There is an objectively and practically measurable some- 
thing which constitutes the core of most aptitudes. It is 
overlaid with various incidental abilities and furthered or 
retarded by emotional or physical characteristics of the in- 
dividual. This something is general intelligence. If an 
individual's intelligence is all that is known some mistakes 
will be made in attempting vocational guidance, but if only 
one thing can be known, general intelligence is perhaps most 
important, for it is in this that individuals differ most and 
most significantly. Most men's legs are sturdy enough to 
carry all the weight of their brain. Most men have enough 
body to carry them successfully for most occupations. 



i8o How to Measure in Education 

Most men who possess good intelligence usually have sense 
enough to realize that they must be fairly honest, decently 
industrious, etc. Failure is most frequently traceable to 
lack of brains. A pupil's intelligence score is an approxi- 
mate measure of the diameter of an approximate general 
ability circle and is hence an approximate basis for voca- 
tional guidance. 

But any individual who assumes that all the spokes in an 
ability wheel are of exactly equal length, or that instances 
of marked special aptitudes do not exist, or even that most 
individuals do not possess some tendency toward a special 
aptitude would make as egregious an error as one who 
assumed that individuals are all markedly lop-sided. Three 
principles or near-principles will make clear the limitations 
of guidance by intelligence tests. 

• The first principle is that to guide a pupil into a highly 
specialized occupation requires a specialized series of tests. 
Certain traits such as mathematical ability, ability in draw- 
ing, musical composition, singing, etc., may be so specific as 
to require a special diagnosis. It is fairly well established 
that a general intelligence measure will not reveal whether 
an individual possesses the peculiar combination of traits 
requisite for success in certain specialized occupations. The 
miniature, analogy, analytic, and empirical- sampling types 
of tests, described in a later chapter, have been used to get 
at certain of these aptitudes. Thorndike's series of tests for 
clerical workers, and Seashore's tests of musical capacity, 
and Rogers' test for the diagnosis of mathematical ability, 
are all attempts to measure the degree of presence of certain 
specialized abilities. 

The second principle is that the lower we go in the occu- 
pational scale and the less the exercise of intelligence is 
required the less significant is an intelligence measurement 
as a basis for vocational guidance. Simple computation, 
checking and the like in clerical work are usually done about 
as well by persons of moderate intelligence as by persons of 
high intelligence, for the reason that the exercise of no more 



Measurement in Vocational Guidance i8i 

than a rudimentary intelligence is required. And appropri- 
ate specialized tests could easily discover individuals of low 
intelligence who have enough aptitude to actually do better 
work than individuals of higher intelligence. If one gets 
down the occupational scale a little farther, a point is soon 
reached where the aptitude of the horse, dog, or cow sur- 
passes that of man. On the other hand, the higher up the 
occupational scale one goes, and the more the positions be- 
come responsible ones, and the more they require the exer- 
cise of a broad general intelligence, the more significant 
differences in intelligence become for purposes of vocational 
guidance. A vocational psychologist could serve the selfish 
purposes of a large industrial concern not only by showing 
when to choose employees for intelligence, but when there 
is little or no advantage in so choosing them. 

The third possible principle is that disabilities are more 
frequent than special aptitudes. It is the presence of 
special disabilities which often explains why an otherwise 
gifted individual fails to succeed in some highly specialized 
occupation. Carney ^ describes a t)^ical instance. A 
graduate of Chicago University, who had an unusually keen 
mind and a pleasing personality, entered a large factory 
and was set to work computing percentages on a slide rule. 
To the surprise of all she failed to do satisfactory work. 
She was sent to Carney to be tested. She proved to be very 
high in intelligence and very low in arithmetic. She was 
assigned to a section which required the continuous exercise 
of general intelligence. In a short time, she had risen to 
the head of this section and was doing remarkably well. 
Thus the intelligence test gave the clue to the existence of a 
special disability. 

Time does not permit the measurement of intelligence, 
and the measurement of all possible abilities and disabilities. 
The measurement of the length of all the spokes in any 
individual's ability wheel is a practical impossibility. It is 

^ Chester S. Carney, "Some Experiments with Mental Tests as an Aid in the 
Selection and Placement of Clerical Workers in a Large Factory"; University of 
Indiana Bulletin, Vol. V, No. i, pp. 60-74. 



1 82 How to Measure in Education 

in fact unnecessary. The variation in the length of these 
spokes from individual to individual is generally so slight, 
or is so insignificant in terms of vocational success that voca- 
tional guidance can ignore them unless the departure from 
the average is so extreme as to attract immediate notice. 
Certain traits of an individual need never be measured 
except in connection with certain occupations. The usual 
variations in finger length are customarily of no occupational 
significance, whereas for a typist or pianist their significance 
may be very great. Variations in beauty are said to be of 
no significance in the teaching profession, whereas they may 
have a profound effect upon the success of private secre- 
taries. A cumulative record of a pupil's scores on educa- 
tional tests given during the school career should be a very 
great aid and time-saver to the vocational counselor. A 
pupil whose spelling is abominable is not likely to succeed 
as a stenographer. A pupil who is slow and inaccurate with 
numbers will be handicapped as an accountant. A survey 
is urgently needed to locate the common causes of pro- 
nounced failures or pronounced successes in the various 
occupations in order that vocational counselors may be on 
the alert for the presence of these traits. 

Trade-Ability Limits in Vocational Guidance. — With 
but very rare exceptions industry expects to employ and 
then train its employees. These recruits frequently leave 
or are dismissed before their training period is completed. 
They seek similar employment elsewhere. The second em- 
ployer has a different problem from that of the vocational 
counselor. He must not only measure capacity to learn 
but must also measure the extent to which the applicant has 
acquired the specific occupational skills. The vocational 
counselor will be face to face with an identical problem just 
as soon as specific vocational training is undertaken on a 
large scale by the public school system. The problem is in 
fact already here. 

The army psychologists were confronted with the task of 
this second employer. They met it by analyzing numerous 



Measurement in Vocational Guidance 183 

jobs and by constructing carefully calibrated oral, picture, 
and performance tests which either measured directly the 
amount of presence of the occupational skill or measured 
certain symptoms of occupational skill. 

The following sample questions on one of the blacksmith's 
tests indicate the nature of the oral test. Why is a flatter 
used? What is shown by white sparks flying from a piece 
of tool steel when it is in the fire? What do you use 
for tempering steel springs? What tool is used to make 
grooves? The picture test presented the candidate with 
tools, machines, products, etc., of his trade and required 
him to identify them and indicate their uses. In the expec- 
tation that some men could do who could not tell, some per- 
formance tests like truck-driving tests were used. Accord- 
ing to Bingham, the expected difference did not materialize 
except in such rare instances as to be practically negligible 
and to justify the conclusion "that superiority in trade pro- 
ficiency resides more often in the head than in the hands." 
The detailed procedure for the construction of such tests 
is described in a later chapter. 

These trade tests illustrate the technique. They are not 
nearly adequate for industrial purposes. They are not ade- 
quate, first, because they do not reveal how well or how 
quickly an individual will learn the trade nor whether he 
has special aptitude for the trade. Intelligence and aptitude 
tests are required for this purpose. They are not adequate, 
second, because they are not nearly numerous enough to 
cover all the many specialized industrial and business 
processes. Companies, like the Scott Company, are carry- 
ing forward, with the cooperation of large industrial com- 
panies, the work of adapting the army technique to the 
construction of needed occupational-ability tests. 

Personnel has an amusing description of what resulted in 
the early days of the late war before trades were defined 
or detailed trade tests had been constructed. The butchers 
who went overseas with the first troops were clerks as often 
as they were butchers. As luck would have it, butchers were 



184 How to Measure in Education 

not needed, due to shipments of frozen meat. So the 
butchers were converted into fairly successful meat book- 
keepers. Later troops were accompanied by genuine 
blood-letters rather than bookkeepers. A bitter complaint 
regarding their efficiency came from France. Investigation 
showed that the real need was for ^ 'paper butchers instead 
of meat butchers." Again, a call came for Multiplex 
puncher operators. Since no one knew that Multiplex 
telegraph puncher operators were a kind of speciaHzed 
typists, the personnel staff took a chance and filled the requi- 
sition with ''drill press punchers, ticket punchers, and cow 
punchers." This is a pretty good description of the voca- 
tional placements now effected by chance, where, according 
to Thorndike, the vocational counselor is a sign which reads: 
Boy Wanted. 

Interest Limits in Vocational Guidance. — The first 
step in vocational guidance is to determine the mental, moral, 
physical, special aptitude, and ability requirements of each 
occupation. The second step is to list for the pupil and 
his parents the occupations in which his ability promises 
success. From these a selection may be made on the basis 
of the pupil's purpose, strength of interest, or preference, 
because every increase in interest materially increases the 
chances for occupational success. 

What is to be done when interest and intelligence conflict? 
In this conflict are found some of the real tragedies of life 
— some striving to be that which they can never be, and 
some fretting because they have no chance to be that which 
they could be. In handling the former situation three prin- 
ciples must be kept in mind. First, intelligence is compara- 
tively unalterable while interest may be altered either as a 
result of maturity or artificial manipulation. Interest can 
be created, intelligence can't. Second, success tends to 
bring interest in an activity which previously was uninterest- 
ing. Any activity which brings monetary rewards or the 
approval of those whose approval is precious tends to become 
suffused with the pleasure which it purchased. Third, after 



Measurement in Vocational Guidance 185 

a certain small interest minimum has been exceeded addi- 
tional increments of interest are probably less effective in 
terms of increased success than additional increments of 
intelligence. This third point is merely a probable hypothe- 
sis, for little is known concerning the relative contributions 
toward success of varying amounts of interest and intelli- 
gence either in school or out. 

The changeable nature of interest is a major problem in 
vocational guidance. The staid business man originally 
intended to be a locomotive engineer. The Greenwich 
Village poet meant to be an army sergeant. 

In so far as the content of elementary and high school 
subjects is representative of occupational content, a study 
by Thorndike ^ of the interests of college students leads to 
the following conclusions: (i) That there is considerable 
permanence of pupils' interests; (2) that there is an equal 
permanence of pupils' abilities; (3) that the transition from 
high school to college marks a more drastic change in inter- 
ests and abilities than the transition from elementary school 
to high school; (4) that there is no point where interests 
and abilities become markedly stabilized; (5) that pupils' 
interests are highly indicative of their abilities. 

Bridges^ and Dollinger conducted an investigation to 
check Thorndike's conclusion that pupils' interests are highly 
indicative of their abihties. All their correlations were very 
low, around .2 and .3. Consequently they concluded that 
interest is not highly indicative of ability and that somehow 
interest and ability must be measured separately. Thorn- 
dike would undoubtedly subscribe to the latter half of their 
conclusion. 

For the purposes of vocational guidance Crathorne's^ 
investigation is more pertinent. He studied the change of 

«E. L, Thorndike, "Early Interests: Their Permanence and Relation to Abili- 
ties"; School and Society, Feb. lo, 1917- ^ 

^ J. W. Bridges and Vernon M. Dollinger, "The Correlation Between Interests 
and Abilities in College Courses"; Psvchological Review, July, 1920. 

8 A. R, Crathorne, "Change of Mind Between High School and College as to 
Life Work"; School and Society, Jan. 3, 1920, 



1 86 How to Measure in Education 

mind between high school and college as to life work in the 
case of 2,000 college freshmen. He found that on entering 
the high school 57 per cent of these students had decided 
upon a life work. Before entering college almost exactly 50 
per cent of these 57 per cent had changed their minds. And 
it was the privilege of the men to change their minds as 
frequently as the women! If it may be assumed that all 
students who go to high school are in the same mental state 
with reference to a vocation as those who enter college, it 
is fair to conclude that of high school freshmen only about 
25 per cent have made anything like a stable vocational 
choice. The mental state of those who do not go to high 
school is not known. It appears as though a student does 
not finally make up his mind until he is in a vocation, and 
even then he frequently changes his choice. Perhaps this 
uncertainty is desirable, perhaps it is not. Educators must 
consider whether the school should help to make possible 
an early and stable vocational choice by giving a wide 
variety of occupational experiences in the school. 

Properly to perform all of the foregoing tasks requires a 
vocational counselor who possesses not only rare insight but 
also technical training^ of a very high order. He must 
know how to analyze occupations to discover the traits re- 
quired for success and in what proportion they are required. 
He must be able to construct tests which will measure these 
traits. He must be able to select tests which correlate with 
demonstrated fitness for a given occupation. He must know 
how to select the proper team of tests weighting each test 
in the team according to (a) the largeness of its self-correla- 
tion, (b) the smallness of its correlation with the other tests 
in the team, and (c) the largeness of its correlation with 
the criterion. He must know how to apply these tests and 
how to select the pupils who can learn an occupation and be 

^ For an admirable brief summary see Truman L. Kelley, "Principles Underlying 
the Classification of Men," The Journal of Applied Psychology, March, 1919, and 
E. L. Thorndike, "Fundamental Theorems in Judging Men," The Journal of Ap- 
plied Psychology, March, 1918. 



Measurement in Vocational Guidance 187 

efficient and happy in it. Ultimately he will have the addi- 
tional problem of distributing the available pupils into the 
available- occupations so as to secure the minimum of mis- 
placement. 

Vocational Guidance for the Gifted Pupils. — A great 
social waste is the vocational exploitation of the unusually 
gifted. With certain exceptions every employer is com- 
peting with other employers to secure the services of the 
most competent. The employer does not stop to consider 
whether he can give the gifted individual, whom he is lucky 
to employ, abundant opportunity to make the greatest social 
contribution of which he is capable. The country suffers 
an enormous loss each year because many of its geniuses 
have been caught by this exploiting system, and placed in 
relatively non-productive positions. The individual em- 
ployer can afford this but society can't. Society's aim is to 
guide no individual into an occupation above his intelligence. 
Society is equally concerned that great gifts be not frittered 
away on small jobs. In sum, we want both minimum and 
maximum intelligence limits for each occupational level. In 
so far as it can be done without doing too much violence to 
individual liberty, the social group should guide each indi- 
vidual to the level fixed for him by nature. Only thus can 
the social group be most efficient, prosperous, and happy. 

In time society will recognize its essential organic nature, 
and then the persons of low and average ability will them- 
selves insist that the able be placed where they can make the 
greatest contribution for the good of all. The gifted, con- 
sidering their superior native endowment as part payment 
for their services, will contribute to the social group without 
extorting undue monetary rewards from the group which 
they serve. Vocational guidance through the schools is 
about the only way to accomplish this great and beneficent 
task. 

Society cannot safely trust its geniuses to find their own 
way through the industrial maze. Immature occupational 



1 88 How to Measure in Education 

preferences frequently lead where there is no turning back. 
Below are selections from a list prepared by Miss Coy^^ 
and another by Mrs. MacKnight of the vocational ambitions 
of gifted children. 

1. "I would like to be an author as it has a certain 
fascination and I have a rather flaming imagination. 
Also, you get a certain amount of royalty on each copy 
of a book that the publisher accepts. This would keep 
any successful writer well provided for." (Age = ii. 
I.Q. = 143.) 

2. "When I will be a man, I will be in the business 
that my father is in. That business is the Stock Ex- 
change. The reason is that you don't sit down like most 
of men who work." (Age = 8. I.Q. = 143.) 

3. "When I grow up I would like to be a research 
scientist of electricity, because I like to work with elec- 
tricity." (Age = 9. I.Q. = 145.) 

4. "When I grow up I think I will be a man who 
runs a boat a with mail up and down the lake and earn 
my living that way." (Age ^ 8. I.Q. = 151.) 

5. "When I grow up I would like to be an inventor. 
The reason why is because I told my father what I was 
going to invent and he said it was all right." (Age = 9. 
I.Q. = 126.) 

6. "I would like to be a surgeon because I would 
help people." (Age= 10. I.Q. = 131.) 

7. A stenographer or music teacher. (Age = 10. 
I.Q. - 141.) 

8. Carpenter or mechanic. (Age =11. I.Q. = 
121.) 

9. A soldier — "not a general or a hero, but just a 
common soldier." (Age =12. I.Q. = 133.) 

10. League baseball pitcher, motorcycle racer, pole 
vaulter, wrestler, and be an "honest man." (Age = 12. 
I.Q. == 122.) 

Miss Coy concludes from her study of the vocational 
preferences of gifted children that in the main few of the 

i«Guy M. Whipple, Classes for Gifted Children, p. 77; Public School Publishing 
Co., 1919. 



Measurement in Vocational Guidance 189 

pupils want to do things for which they lack ability but 
that there is a tendency to report ambitions that seem dis- 
tinctly too low. She further concludes that efforts to 
properly educate superior pupils should include a systematic 
effort ''to foster and develop ambitions commensurate with 
the latent capacities revealed by objective testing." 

Vocational Guidance for the Intellectually Subnormal. 
— Some fear that the guidance of pupils of low ability 
into occupations with low intelligence minima is essentially 
undemocratic. Perhaps it is, according to some definitions 
of democracy. What of it, if such guidance is the best thing 
to do? So long as occupational levels are not closed to those 
in a lower level, such guidance is a genuine kindness to the 
individual and decidedly advantageous to the social group 
as a whole. 

It is a genuine kindness to the individual for at least three 
reasons. In the first place, he will succeed better in occu- 
pations requiring little intelHgence. He will succeed not 
so much because his ability is greatest for these occupations 
as because here and only here will he be in competition 
with his own kind. Most of the abler individuals will 
have risen to occupational levels where more of the world's 
rewards are for distribution. 

In the second place he will succeed better because the 
public is willing to pay him for services rendered on this 
level. There are men making a reasonably good hving as 
street cleaners who would starve as school teachers. The 
public is willing to pay them for cleaning the streets but is 
utterly unwilling to pay them a cent for teaching school, or 
practicing medicine. 

Finally, the mentally subnormal are probably happier at 
work which is physical and routine. There are numerous 
persons to whom, like Javert, thought is singularly painful. 
Consider the per cent of individuals who would beg to dig 
ditches or pound rocks to avoid writing this book or even 
reading it. 



I go How to Measure in Education 

SUPPLEMENTARY READING FOR PART I 

Ballard, Philip B. — Mental Tests; Hodder and Stough- 
ton, Ltd., Warwick Square, London, 1920. 

BucKNER, Chester A. — Educational Diagnosis of Indi- 
vidual Pupils; Teachers College, Columbia University, 
New York, 191 9. 

Bloomfield, Meyer. — Youth, School, and Vocation; 
Houghton Mifflin Company, New York, 191 5. 

Burgess, May Ayres. — Measurement of Silent Reading; 
Russell Sage Foundation, New York, 1920. 

Chapman, J. Crosby. — Trade Tests; Henry Holt & Com- 
pany, New York, 192 1. 

Chapman, J. C, and Rush, G. P. — Scientific Measurement 
of Classroom Products; Silver, Burdett & Company, 
Boston, 191 7. 

Courtis, Sz-A. — The Gary Public Schools: Measurement of 
Classroom Products; General Educational Board, New 
York, 1919. . 

Davis, Jesse B. — Vocational and Moral Guidance; Ginn & 
Company, New York, 19 14. 

Dewey, Evelyn; Child, Emily; and Ruml, Beardsley. — 
Methods and Results of Testing School Children; E. 
P. Dutton & Company, New York, 1920. 

Fretv^ell, Elbert K. — A Study in Educational Prognosis; 
Bureau of Publication, Teachers College, Columbia 
University, New York, 191 9. 

HoLLiNGw^ORTH, H. L. and L. S. — Vocational Psychology; 
D. Appleton & Company, New York, 191 6. 

HoLLiNGWORTH, L. S., and Winford, C. A. — The Psy- 
chology of Special Disability in Spelling; Teachers 
College, Columbia University, New York, 191 8. 

HuEY, Edmund B. — Psychology and Pedagogy of Reading; 
Macmillan Company, New York, 1908. 

Various Conferences on Educational Measurement; Indiana 
University Bulletins, University of Indiana, Blooming- 
ton, Indiana. 



Measurement in Vocational Guidance 191 

JuDD, Charles H. — Measuring the Work of the Public 
Schools; Cleveland Education Survey, Russell Sage 
Foundation, New York, 191 6. 

JuDD, Charles H., and Others. — Reading: Its Nature and 
Development ; University of Chicago, Chicago, 1918. 

Kelley, T. L. — Educational Guidance: An Experimental 
Study in the Analysis and Prediction of Ability of High 
School Pupils; Teachers College, Columbia University, 
New York, 19 14. 

Kruse, Paul. — The Overlapping of Attainments in Certain 
Grades; Bureau of Publication, Teachers College, 
Columbia University, New York, 191 8. 

Link, H. C. — Employment Psychology; Macmillan Com- 
pany, New York, 191 9. 

Monroe, Walter S. — Measuring the Results of Teaching; 
Houghton Mifflin Company, New York, 191 8. 

Monroe, Walter S.; De Voss, J. C; and Kelly, F. J. — 
Educational Tests and Measurements; Houghton 
Mifflin Company, New York, 191 7. 

MuNSTERBERG, HuGO. — Psychology and Industrial Effi- 
ciency; Houghton Mifflin Company, New York, 19 13. 

National Society for the Study of Education. — Vari- 
ous Year Books; Public School Publishing Company, 
Bloomington, Illinois. 

PiNTNER, Rudolf, and Paters on, Donald. — A Scale of 
Performance Tests; Warwick & York, Baltimore, 191 7. 

Rogers, Agnes L. — Experimental Tests of Mathematical 
Ability and Their Prognostic Value; Bureau of Publica- 
tion, Teachers College, Columbia University, New 
York, 1918. 

Starch, Daniel. — Educational Measurements; The Mac- 
millan Company, New York, 191 7. 

Steayer, George D., and Others. — Report of a Survey of 
the School System of St. Paul, Minnesota; Department 
of Public Instruction, St. Paul, Minnesota, 191 7. 

Terman, Lewis M. — The Measurement of Intelligence; 
Houghton Mifflin Company, New York, 191 6. 



192 How to Measure in Education 

Terman, Lewis M. — The Intelligence of School Children; 

Houghton Mifflin Company, New York, 191 9. 
Theisen, W. — Report on the Use of Some Standard Tests 

for igi6-i7; Wisconsin State Department of Public 

Instruction, Madison, 1918. 
Whipple, Guy M. — Classes for Gifted Children; Public 

School Publishing Company, Bloomington, IHinois, 

1919. 
Wilson, G. M., and Kremer, J. Hoke — How to Measure; 

The Macmillan Company, New York, 1921. 



PART TWO 

HOW TO CONSTRUCT AND STANDARDIZE 

TESTS 

CHAPTER VII. PREPARATION AND VALIDATION OF 
TEST MATERIAL 

CHAPTER VIII. ORGANIZATION OF TEST MATERIAL 
AND PREPARATION OF INSTRUCTIONS 

CHAPTER IX. SCALING THE TEST 

CHAPTER X. SCALING THE TEST. T SCALE— AGE VARI- 
ABILITY UNIT 

CHAPTER XI. DETERMINATION OF RELIABILITY, 
OBJECTIVITY, AND NORMS 



CHAPTER VII 

PREPARATION AND VALIDATION OF TEST 

MATERIAL 

I. Logical and Experimental Validation of Tests 

What Is a Valid Test? — Tests have many charac- 
teristics such as vaHdity, reHabihty, objectivity, etc. Of all 
these traits validity is most fundamental. What is meant 
by vaHdity? The National Association of Directors of Edu- 
cational Research has defined validity as the correspondence 
between the ability measured by the test and ability as other- 
wise objectively defined and measured. When a test really 
measures what it purports to measure and consistently 
measures this same something throughout the entire range 
of the test it is a valid test. 

Ask a cautious psychologist just what a given test 
measures and he will answer somewhat viz: "It measures 
the ability to do so and so with the material which you see 
on the test sheet, when the test is applied under certain con- 
ditions.'^ If you are dissatisfied with this conservative state- 
ment you may enquire: "Will the pupil who deals with 
these test difffculties with a given degree of excellence deal 
with these apparently same difficulties when imbedded in a 
real, practical life situation with an equal degree of excel- 
lence?" 

No one knows very much about just how close results for 
the different tests are to the results in actual practice. We 
give a class a paper test composed of twenty reasoning prob- 
lems in arithmetic. Johnny does eighteen of the twenty 
problems. Had he met these twenty problems at the store or 
the post office or the playground, would he have succeeded 

195 



196 How to Measure in Education 

with these same eighteen problems and failed on the same 
two? Nobody knows. If he did not do the eighteen but did 
do sixteen, would Mary and Lucy who did fourteen and 
twelve test problems respectively show proportional de- 
creases when faced with real problems or might they possibly 
surpass Johnny? In other words, if test results and hfe 
results do not coincide, do they even correlate, i. e., does 
the pupil who makes the highest test score make the highest 
life score and the one who makes the second highest test 
score make the second best life score and so on? Nobody 
knows. We know enough to say that there will be a rough 
correspondence and probably a close correspondence, for 
the chasm between test conditions and life conditions does 
not yawn as wide as some would have us believe. It is un- 
doubtedly wider for some tests than for others. 

The layman is usually concerned with this problem of the 
correspondence between tests results and practical life re- 
sults. The technical worker in measurement is equally con- 
cerned to know whether a test is a valid measure of some 
element of an analyzed ability. The following suggestions 
as to how to secure validity apply primarily to securing a 
close correspondence between test results and practical life 
results. 

Suggestions for Increasing Validity. — Test results 
are more comparable to life results the more nearly the test 
process approaches the character of the life process. The 
ability of pupils to spell, for example, may be determined 
by (a) searching through their letters, compositions and 
the like, (b) having them write dictated sentences in which 
the critical words are imbedded, (c) having them write iso- 
lated words pronounced by the examiner. The composition 
method more nearly duphcates the life process, the dictation 
method next, and the column spelling last. Again, pupils' 
ability in grammar can be measured by making an analysis 
of their written or oral compositions or by giving them a 
specially devised grammar test. The former test would 
yield more natural results. It is of course one of the per- 



Preparation and Validation of Test Material 197 

versities of fate that an increase in naturalness is attended 
by an increase in inconvenience. 

Tests vary greatly in the exactness with which they repro- 
duce the life process. HolHng worth ^ lists four fundamental 
types of tests: miniature, sampling, analogy, and empirical. 

He writes that in the case of the miniature test the ^'entire 
work, or some selected and important part of it, is repro- 
duced on a small scale by using toy apparatus or in some 
such way duplicating the actual situation which the worker 
faces when engaged at his task. Thus McComas, in test- 
ing telephone operators, constructed a miniature switchboard 
and put the operators through actual calls and responses, 
meanwhile measuring their speed and accuracy by means 
of chronometric attachments." 

The sampling test measures a candidate's ability to do 
an actual sample instead of a toy representation of a given 
occupation. A would-be stenographer is given an actual 
test of ability with dictation and with a typewriter. A cler- 
ical aspirant is set to finding addresses in a telephone direc- 
tory or copying a table of figures. Practically all educa- 
tional tests are dummy samplings of this variety. We test 
a pupil's reading ability by samples of reading material. We 
test his ability to solve problems in arithmetic by giving him 
sample problems in arithmetic to do. 

The analogy test employs material which is neither the 
same as nor similar to the material of the occupation, but it 
is supposed to exercise those mental traits requisite for suc- 
cess in the occupation. To quote Hollingworth again: 
''Thus girls employed in sorting steel ball-bearings, and also 
typesetters, have been selected on the basis of their speed of 
reaction to a sound stimulus." During the World War, 
Stratton, Henmon, Thorndike and others attempted to devise 
tests which would be diagnostic of ability for flying. At 
that time no empirical tests existed, and dummy tests were 
impractical. So those who were working on the problem 
first made an analysis of the mental and physical character- 

^ H. L. and L. S. Hollingworth, Vocational Psychology; D. Appleton & Co. 



1 98 How to Measure in Education 

istics upon which success in flying would logically seem to 
depend, and then devised means for measuring a candidate's 
possession of these traits. Tests were devised to measure a 
candidate's sense of balance, perception of tilt, nerve-resist- 
ance to sudden sensory shock and the like. By checking 
each of these tests against subsequent success as aviators, it 
was found that some had no significance at all, while others 
were slightly symptomatic. A composite score from those 
tests which were found valuable, selected aviators with fair 
accuracy. In similar manner tests were constructed to select 
shell inspectors, gun assemblers, etc. Pursuing this same 
method of analysis, Rogers has constructed tests for deter- 
mining whether pupils possess mathematical capacity. 
Briggs has constructed similar tests for foreign language 
capacity. 

The empirical tests are those which were discovered from 
a more or less haphazard trial and error search. The test 
selector makes no conscious attempt to select or construct a 
test which is a miniature or sampling or analogy. He tries 
out a number of tests, eliminates those which are not symp- 
tomatic and retains those which are. Analysis of the mental 
traits and their combinations requisite for success in the 
various educational or industrial occupations has not yet 
progressed far enough to offer a sure basis for the selection 
and construction of tests. Hence it is the opinion of many 
psychologists that the empirical method of discovering occu- 
pational tests is more promising than any other for the imme- 
diate future. 

Test results are more comparable to life results when they 
are free from ir relevancies. To return to the illustration of 
a reasoning test in arithmetic, the arithmetic problems prob- 
ably more nearly duplicate real problems when they are free 
from non-arithmetical difficulties. Complicated instruc- 
tions for the test might so confuse the pupils as to leave no 
fair opportunity to attack the arithmetical difficulties. 
Again, a complex wording of the problems might make the 
linguistic difficulty of greater consequence than the difficulty 



Preparation and Validation of Test Material 199 

of the mathematical processes themselves. In selecting or 
constructing tests they should be carefully studied to dis- 
cover whether everything possible has been done toward 
the elimination of ir relevancies in instructions and in the 
organization and wording of the test elements, or at least 
toward determining the influence of these irrelevancies. 

While linguistic irrelevancies are more common, they are 
not the only kind by any means. The form of the test is 
often an irrelevancy. Not only must the pupil overcome 
the difficulties of the real test material, which is always to 
some extent camouflaged by linguistic irrelevancies, but he 
must also overcome the difficulty of the general form in 
which the test is couched. These moulds for test material 
are many. There are the question mould, completion mould, 
classification mould, matching mould, and many others. All 
these irrelevancies are important elements of difficulty 
especially for young children. They do greatest harm in 
rate tests where the speed score of the pupil is r^.uch influ- 
enced by the rapidity with which he adapts himself to the 
test. 

Terman ^ says of the army intelligence test Alpha: "The 
test questions were ingeniously arranged so that practically 
all could be answered without writing, by merely drawing 
a line, crossing out or checking." There were various 
reasons for this provision, such as to require less time for 
testing and to make scoring economical and objective. But 
a very important reason was to make a test which would 
test the thing for which the test was designed. It was de- 
signed to measure general intelHgence. If writing were 
made a prominent feature of the test, the test would tend 
to give a measure of speed of handwriting rather than of 
intellectual ability. Individuals are more alike in their 
speed of checking, crossing out and underlining than they 
are in speed of penmanship. 

It is possible, especially in the case of very long tests, 
that the chief factor measured is not the ability desired but 

^Psychological Bulletin, June, 1918. 



2 00 How to Measure in Education 

fatiguability. The test should be of such a length or so con- 
structed as to eliminate fatigue, particularly if some of the 
pupils fatigue more easily than others. This point needs 
most attention when comparisons are to be made between 
young and old children. 

Fatigue may be eliminated in various ways. First, the 
test can be made short. Second, if rehabihty requires a 
longer test, the test can be divided into parts with a rest or 
exercise interval between. Third, if the test consists of a 
series of short tests, the shorter tests may be so arranged 
as to have difficult tests followed by easy tests and tests of 
one nature followed by tests of another nature and vice 
versa. Fourth, the test can be made variegated and inter- 
esting both as to type and material. The material in the 
Alpha intelligence test for the army, for example, kept the 
recruits in a merry and at times almost boisterous mood 
throughout. 

The foregoing propositions concerning irrelevancies should 
be accepted with caution and applied with care. The prop- 
ositions were made more to direct attention to certain prob- 
lems rather than because they have a firm experimental 
basis. If the examiner's purpose is to make a psychological 
study of pure arithmetical abilities there can be no ques- 
tion but that every possible linguistic or other irrelevancy 
should be eliminated from the tests used. Similarly when 
linguistic ability is being measured, all non-linguistic dif- 
ficulties should be eliminated. But if life's arithmetic prob- 
lems are to be duphcated we cannot be so sure of the value 
of eliminating all irrelevant difficulties. When a child pays 
for purchases in a store he must steer his course through 
numerous distractions which are not all mathematical in 
their nature. Since these practical distractions cannot con- 
veniently be duplicated in a test, perhaps the linguistic or 
other difficulties should be retained as a sort of substitute. 
Again, the propositions should be applied with care, because 
an irrelevancy in one test may not be so at all in another 
test. If the form or mould of a test duplicates the pattern 



Preparation and Validation of Test Material 201 

of the pupil's mental processes in performing an actual task, 
the form of the test is not an irrelevancy. A casual inspec- 
tion of the following task taken from the Woodworth-Wells 
Directions Test would give one the impression that the whole 
test is nothing but an irrelevancy, and yet this impression 
would be a mistake, for the purpose of the test is to measure 
the ability to deal with just such complicated directions. 

"With your pencil make a dot over any one of these letters F G H I J, 
and a comma after the longest of these three words: boy mother girl 

Then, if Christmas comes in March, make a cross right here , but 

if not, pass along to the next question, and tell where the sun 

rises If you believe that Edison discovered America, cross 

out what you just wrote, but if it was someone else, put in a number to 

complete this sentence: 'A horse has feet.' Write yes, no 

matter whether China is in Africa or not ; and then give a 

wrong answer to this question: 'How many days are there in the week?' 

Write any letter except g just after this comma, and 

then write no if 2 times 5 are 10 Now, if Tuesday comes after 

Monday, make two crosses here ; but if not, make a circle 

here or else a square here Be sure to make three 

crosses between these two names of boys: George Henry. 

Notice these two numbers: 3. 5. If iron is heavier than water, write 

the larger number here , but if iron is lighter write the smaller 

number here Show by a cross when the nights are longer : in 

summer ? in winter ? Give the correct answer to this 

question: 'Does water run uphill?' and repeat your answer 

here Do nothing here (5-|-7= )> unless you skipped 

the preceding question; but write the first letter of your first name and 
the last letter of your last name at the end of this line:" 

Increasing Validity Through Comprehensive Meas- 
urement. — A test can be made comprehensive, and in a 
certain sense more valid, by including all the material within 
the subject or ability being measured. This is feasible when 
the examiner is interested in only a narrow ability or limited 
field of subject matter. It is possible to comprehensively 
measure even a comprehensive subject matter but to do so 
might require almost as much time as it took the pupil to 
learn it. Hence some more economical method must be 
found for measuring a comprehensive ability. 

A test can be made comprehensive by including random 
samplings of the abihty in question. In order to determine 
how many words a pupil can spell, or define, or use, it is 
not necessary to try him on every word in Webster's Die- 



202 How to Measure in Education 

tionary. It can be done just as well by taking from the 
dictionary a random sampling of its words. In making such 
a sampling it is important that the samplings be made ran- 
dom, and that enough samples be employed to yield a reliable 
measure of the pupil. Randomness may be secured by using 
the first or ninth or any other numbered word on each page 
or each third page or each twenty-fifth page or the like of 
the dictionary. This will suggest how chance samplings 
may be made from a variety of subject matter. It is worth 
pointing out that when test material is selected according to 
this random-sampling method, the construction of duplicate 
tests becomes a very simple matter. The value of such 
duplicate tests will appear later. It should be remembered 
that the method of random sampling answers only the. ques- 
tion: What per cent of a total field of knowledge does a 
pupil know? Except for the elements in the test, such a test 
leaves us in ignorance as to just what elements in the field of 
knowledge the pupil knows. 

To overcome this last obstacle, especially in the field of 
skill tests, it has been suggested that comprehensiveness be 
secured by using> type material. This type principle of 
selection assumes that each subject involving skill contains 
typical units or typical processes, and that the pupil's ability 
in the entire subject is substantially determined by measur- 
ing his ability in the type processes. The fundamentals of 
arithmetic, for example, are supposed to contain certain type 
processes. The ability to carry in addition is one such type 
process. The ability to fix the decimal point in division is 
another type process and so on. It is held that a test to be, 
representative of the fundamentals of arithmetic must con- 
tain every type process. Ballou and Monroe have each 
analyzed out these processes in the fundamentals of arith- 
metic and have constructed tests to measure them. 

Monroe ^ has criticized the Woody Arithmetic Scales be- 
cause Woody did not select examples for his tests primarily 

' Walter S. Monroe, "An Experimental and Analytical Study of Woody's Arith- 
metic Scales"; School and Society, Oct. 6, 1917. 



Preparation and Validation of Test Material 203 

on a type basis. Monroe contends that Woody sacrificed 
diagnostic ability to statistical beauty, since Woody retained 
examples in his scales primarily because of their statistical 
behavior — because of their difficulty. 

Another principle for selecting test material which has 
come into common use is the social-worth principle. This 
principle makes comprehensiveness subordinate to relative 
value. The social- worth principle assumes that the most 
valuable information for the school will come from testing 
the pupil's ability to spell only those words, or solve only 
those problems, or demonstrate a knowledge of only those 
historical facts which are of greatest social value. The best 
illustration of a test whose construction has been guided by 
this principle is the Ayres or Ashbaugh Spelling Test. The 
Ayres test contains 1000 words which were selected by ex- 
haustive investigations to discover which words were most 
frequently used. Similar surveys for other subjects will 
make it possible to construct other tests in accordance with 
this principle. 

Comprehensiveness requires that we not only measure how 
much a pupil can do and how well he can do it, but also we 
must measure how rapidly he can do it. This proposition 
needs no justification, for the practical importance of such a 
diagnosis of the pupil's habit of work is obvious. At least 
one major aim of the school is to prepare the pupil for effec- 
tive participation in the social group. The social group does 
not want the pupil's ability, nor does the pupil derive much 
joy or profit from his ability, if he falls below a minimum 
of speed. Thus the three main dimensions of a pupil's 
abihty are (a) how much or how difficult, (b) how well or 
how accurately or with what quality, and (c) how rapidly. 
If reading is to be measured, a test or tests (for frequently 
all three dimensions cannot well be measured in a single test) 
should be constructed which will measure all three aspects of 
reading. 

Validation of a Trade Test. — How may we know 
whether a given test measures the ability which we desire 



204 How to Measure in Education 

measured? We know what a test measures only by its cor- 
relations. Does a pupil's score on an intelligence test coin- 
cide with the school's and world's estimate of this pupil? 
Does the arithmetic test indicate how well a pupil will be able 
to work examples or solve arithmetical problems in the store 
or in those realms for which the school is preparing the 
pupil? 

Two ways of determining this correspondence are avail- 
able. One method is to give a test to a group of pupils, 
to preserve the records, to follow up the testing with pro- 
longed careful observation of how well these pupils do in real 
situations which may or may not be arranged by the inves- 
tigator, to rank the pupils in order of their abihty first, on 
the test and second, in the real situation, and finally, by 
the method of correlation described later to determine the 
correspondence between these two rankings. If the agree- 
ment is close the test does measure real ability in the sense 
that it can rank a group of pupils in order for their posses- 
sion of the ability in question. An even more careful tech- 
nique is required to determine the extent to which a pupil 
will make the identical score in both the test and the life 
situation. 

The second method available is to apply the test which is 
being validated to a group of individuals whose real ability is 
already known. If the test distinguishes the different de- 
grees of known merit, we can call the test satisfactory. 
Ruml, Robinson, Chapman, Meine, Kruse, Wiley, Toops, 
and others constructed about loo Trade Tests for the army 
during the war. As the following quotation from the 
Psychological Bulletin, June, 191 8, will show, they employed 
this second method to determine whether their tests really 
measured the trade skills which the tests purported to 
measure. A long quotation is given below in order to show 
details in the process of trade test construction. Few edu- 
cational tests are constructed with such careful attention to 
what the tests really measure. The test is usually assumed 
to measure what it looks as though it measures. 



Preparation and Validation of Test Material 205 

"When the problem of formulating tests was analyzed, it 
was seen that certain requirements were fundamental. A 
good trade test: (i) Must differentiate between the various 
grades of skill; (2) Must produce uniform results in various 
places and in the hands of individuals of widely different 
characteristics; (3) Must consume the least amount of time 
and energy consistent with satisfactory results. 

''While there are all degrees of trade ability among the 
members of any trade, it is convenient to classify them in a 
few main groups. Ordinarily the terms Novice, Apprentice, 
Journeyman and Journeyman Expert (or Expert) are em- 
ployed. The Novice is a man who has no trade ability 
whatever, or at least none that could not be paralleled by 
practically any intelligent man. The Apprentice has ac- 
quired some of the elements of the trade but is not suf- 
ficiently skilled to be entrusted with any important task. 
The Journeyman is qualified to perform almost any work 
done by members of the trade. The Expert can perform 
quickly and with superior skill any work done by men in 
the trade. 

"It is sometimes desirable that the Trade Test should 
differentiate between the skill of different members of the 
same group; for instance, the journeyman group. It is 
essential that it should differentiate between the journeyman 
and the apprentice, and the apprentice and the novice. 
Trade tests devised to make this classification are of three 
kinds: oral, picture and performance. 

"The oral tests are most generally used because they are 
of low cost and they may be applied to a large number of 
men in a comparatively short time and without much 
equipment. They are satisfactory in determining the pres- 
ence or absence of trade ability and in many instances deter- 
mine the degree of ability with such accuracy that no other 
tests are required. 

"An Oral Test is developed by passage through twelve 
stages: (i) Priority, (2) assignment, (3) inquiry, (4) col- 
lection, (5) compilation, (6) preliminary sampling, (7) 
revision, (8) formulation, (9) final sampling, (10) evalua- 
tion, (11) calibration, (12) editing. 

^'Collecting the Trade Information, — From time to time 



2o6 How to Measure in Education 

the Personnel Organization of the Army submits to the 
Central Trade Test Office (Newark, N. J.) a hst of trades 
which are required in Army use and for which tests are 
urgently needed. Upon the basis of this list, assignments 
are made to the field staff. 

"The field staff then makes thorough inquiry into the 
conditions of the trade. Their purpose is three- fold: 

"i. To determine the feasibility of a test in this trade. 
It was found, for example, that the trade of gunsmith was 
not a recognized trade, though there were gun repairers. 

"2. To determine the elements which require and permit 
of testing. In other words, can men be graded in it accord- 
ing to degrees of skill? In some trades it was found that the 
trade required simply the performance of a single set of 
operations and there were no gradations among the members 
of the trade. 

"3. To determine the kinds of tests that can be used. 
Some trades, such as truck driving and typewriting, are 
mainly matters of skill and for them performance tests are 
better than oral tests. Other trades, such as interior wiring 
and power plant operation, are mainly matters of knowledge. 
For these trades oral and picture tests are best. 

"After having discovered by inquiry that the trade is a 
recognized trade and can be tested, the field staff proceeds 
to collect all the information necessary from all available 
sources; for example, experts of the trade, trade union 
officials, literature of the trade, trade school authorities, 
employers and the like. They discover by this means what 
are the elements of the trade and what constitutes pro- 
ficiency in it. 

"Compiling the Questions. — As a result of this collection 
of information they compile a number of questions, usually 
forty to sixty, each of which calls for an answer that shows 
knowledge of the trade. Experience in the formulation of 
such questions has shown that a good question meets the 
following requirements: 

"i. It must be in the language of the trade. 

"2. It must be a unit, complete in itself and requiring no 
explanation. 



Preparation and Validation of Test Material 207 

"3. It must not be a chance question which could be 
answered by a good guess. 

"4. It must be as short as possible and must be capable 
of being answered by a very short answer. 

"5. It must not be ambiguous; the meaning must be un- 
mistakable. 

"After the large number of questions originally formu- 
lated has been sifted down by application of these require- 
ments they are used in a preliminary sampling on a number 
of tradesmen whose answers indicate the merits of the 
different questions and their grades from easy to difficult. 
In this sampling, tradesmen from different shops or plants 
are tried, in order to guard against specialized methods or 
modes of expression confined to a single locality. At least 
two examiners work on each set of questions at this stage 
to get the benefit of more than one point of view for revision. 

"This preliminary sampling affords a means of checking 
on the following points: 

"i. Is the test applicable to trade conditions? 

"2. Does the test represent good trade practice? 

"3. In what way can parts be profitably modified, sup- 
plemented or eliminated? 

"4. Does the test represent the whole range of the trade 
from the novice to the expert? 

"5. Is it a representative sampling of the whole range of 
trade processes? 

"In the light of the answers to these questions, the test 
is revised and is then ready to be formulated. 

''Final Sampling. — Final sampling is made by testing 
twenty men who are known to be typical representatives of 
each group (novice, apprentice, journeyman, expert). 
Among the novices tested are some highly intelligent and 
mature men of good general knowledge but no trade ability. 
Examinations are given to men whose record in the trade is 
already known and who are tested as nearly as possible in 
the same manner as men in the camps. 

"The results of this final sampling are now turned over 
to the Statistical Department of the Central Trade Test 
Office. The experts in this department make a careful study 



2o8 How to Measure in Education 

of the results and of the answers to each question. This 
enables them to determine the relative value of each indi- 
vidual question and the selection that makes a proper 
balance. 

"Evaluating the Test. — If a trade test is good, a known 
expert, when tested, is able to answer all, or nearly all, the 
questions correctly; a journeyman is able to answer the 
majority; an apprentice a smaller part, and a novice prac- 
tically none. This does not mean that each question should 
be answered correctly by all the experts, a majority of the 
journeymen, some apprentices but no novices. There are 
few questions which show^ this result. 

''Other types of questions, however, are more common. 
Some show a distinct line of cleavage between the novice 
and the apprentice. Novices fail, but apprentices, journey- 
men and experts alike answer correctly. There are likewise 
questions that are answered correctly by nearly all journey- 
men and experts but only a few apprentices, and questions 
that only an expert can answer correctly. Each type of 
questions has its value in a good test. The main require- 
ment is that the tendency of the curve should be upward; a 
question which is answered correctly by more journeymen 
than experts or more apprentices than journeymen is unde- 
sirable and is at once discarded. A proper balance is made 
of the others. 

"Calibrating the Test. — One task still remains; namely, 
that of calibrating the test. As each question is allowed four 
points, it becomes necessary to determine how many points 
should indicate an expert, how many a journeyman, etc. 
Obviously the way to do this is to note how many points 
were scored by the known experts and the known journey- 
men when they were tested. Ordinarily the expert scores 
higher than the journeyman and the journeyman higher 
than the apprentice. It frequently happens that a few 
journeymen score as high as the lowest of the experts and a 
few apprentices as high as the lowest of the journeymen. 
There are consequently certain overlappings between the 
classes. In calibrating, the object is to draw the dividing 
line between classes so that the overlapping shall be as small 
as possible. 



Preparation and Validation of Test Material 209 

"When these dividing lines, or critical scores as they are 
usually called, are established, the test is ready for distri- 
bution to camps." 

Suppose that we give a group of pupils a test in arith- 
metical problems, and then, without arousing the suspicion 
of the pupils, arrange the situation so that these same pupils 
will meet these same arithmetical problems in their play life 
on the street, and suppose that the test and the observations 
upon the pupils' success with the play problems are reliable 
measures of each of these abilities and suppose, finally, that 
the correlation between the test and the observations is of 
only average closeness, does this condemn the test as not 
being a measure of real ability? Assuming that proper ex- 
perimental precautions have been taken, this correlation cer- 
tainly tells us that the test problems are a rough but not an 
accurate measure of play problems. But before we con- 
demn the test we ought to correlate the pupils' scores on play 
problems with their scores on those same problems when 
shopping for their mothers or some other practical situation. 
It is not known, but it is very possible that the correlation 
between different real-life situations is no closer than be- 
tween the test and any one of these situations. In sum, it is 
even probable that there is no such thing as real ability, in 
the sense that we are discussing it, but that there are instead, 
many abilities differing somewhat one from another. It is 
hopeless to expect to find a test which will closely correlate 
with each of these life situations, wrapped about, as each is, 
with its own individuality or specificness. 

It might be possible to eliminate experimentally, all of the 
specificness belonging to our test and each life situation, and 
thus demonstrate a perfect correlation between all the thus 
purified abilities. Such an analysis of abilities would be of 
considerable theoretical interest and, as we saw in our dis- 
cussion of diagnosis, would be of great value in connection 
with remedial instruction. But for the purpose of prophesy- 
ing success in life and the like, we cannot deal with these 
rarefied abstractions of abilities, for abilities must always 



210 How to Measure in Education 

function through specific situations. It would have been of 
no comfort to Ruml and his colleagues to know that their 
Trade Tests, when experimentally purified, correlated per- 
fectly with similarly purified trade situations. They were 
asked to construct tests which would, with the least error, 
select men w^ho could make good in a variety of specific situ- 
ations. It is no condemnation of an educational test if it 
shows only substantial correlation with a variety of real sit- 
uations. It is a condemnation when it shows little or no 
correlation with real abilities or when it shows less correla- 
tion with such abilities than some other available test which 
is equal in all other requirements. 

II. Validation of an Intelligence Test 

Criterion of Intelligence. — How may we determine 
whether a particular test yields a valid measure of intelli- 
gence? As was stated a few pages back, what a test really 
measures is known only by its correlations. The closeness 
of a test's correlation with what constitutes intelligence is a 
measure of its excellence as an intelligence test. This is the 
single ultimate method of choosing such a test. If this con- 
dition is satisfied the examiner need look no further. In 
order to determine a test's worth, it is necessary to have a 
number of pupils accurately rated for intelligence. Tests 
can then be correlated with this rating. 

This intelligence rating is usually secured by having the 
most competent obtainable persons observe the objective 
behavior of the pupils and, allowing for non-intelligence fac- 
tors, estimate the degree of intelligence indicated by this 
objective behavior. There is no reason why these observa- 
tions should not extend over a life time and then be checked 
by historical verdict. By means of a rating so obtained, 
we can, by the trial-and-error process, find a test which 
gives results which closely agree with the observations of 
competent observers. Once one test has been carefully vali- 
dated in this fashion, other tests can be constructed and cor- 
related with the original test. 



Preparation and Validation of Test Material 211 

If social judgment is the final criterion of intelligence, why 
not employ it exclusively? Tests are resorted to not only 
because they are far more economical but particularly be- 
cause they are impersonal and prophetic. History has 
changed too many pillories to monuments, and parental eval- 
uations of children have been too frequently reversed for us 
not to know that subjective judgment often tends to be 
prejudiced unless the observations are safeguarded with un- 
usual care. For this reason the relatively ice-cold mob- 
proof, care fully- validated intelligence tests are coming to be 
used more and more for intellectual determination. 'Intel- 
ligence tests possess the further incalculable value of being 
prophetic. Tests do not wait until success is achieved before 
passing verdict. They do not lay roses on the grave of a 
genius, but crown him in childhood. ' 

Must intelligence tests always be searched for in the ex- 
ceedingly wasteful trial-and-error fashion? Cannot the 
psychological analysis of intelligence be carried far enough 
to at least indicate the general direction for test construc- 
tion? It can, but psychological analysis is no substitute for 
experimental evidence of validity. For illustrative purposes 
such a tentative analysis, in terms of the neural mechanism, 
is outlined below. 

Analysis of Intelligence. — i. The number of desirable 
neural connections. — One of psychology's important criteria 
of superior or inferior intelligence is the differences in the 
ability for minute analysis, ''piece-meal activity," or to deal 
with subtle elements of a situation. In neural terms this 
means that tlie more intelligent individual has more neural 
connections for any one situation. To a stupid individual a 
ripe peach will probably suggest gastronomic satisfaction 
only, while to the more intelligent it suggests this to be sure, 
but it may also suggest the flush of dawn, the blush of a 
maid, the softness of a baby's cheek, or the fruit of the Tree 
of Knowledge! The flower in the crannied wall was more 
to Tennyson than a pretty weed to adorn a vain buttonhole ! 
To those with numerous neural connections ''every chip 



212 How to Measure in Education 

sprouts wings to bear a god" and falling apples cause a flow 
of ideas as well as a flow of saliva! To a Woodberry, "a 
rose shadows us with Persia, or a single lotus blossom un- 
bosoms all the Nile." 

2. The system of organization of connections. — Mental 
"short cuts" and "hierarchies" are essential for effectiveness 
in novel thinking and highly skilled action. And a hier- 
archy is simply a very complex coordination of neural con- 
nections. System, coordination, organization are as im- 
portant in the mental sphere as they are in a telephone 
system or in any social, business, or industrial sphere. 
While educational experts disagree as to whether the unit of 
organization should be projects or general principles or some- 
thing else, all agree that education should result in neural 
organizations around some units. Two children with an 
equal number of neural connections might vary enormously 
in intelligence, due to differences in the system of organiza- 
tion of their connections caused either by heredity or experi- 
ence. 

3. The ease of forming and breaking neural connections. 
— Intelligence appears to be highly correlated with plasticity, 
and plasticity applies equally to the forming or breaking of 
a connection. It seems to be a general characteristic of 
stupid individuals that they form connections or learn very 
slowly, but when a habit of acting or thinking has been estab- 
lished by dint of much effort, they do not readily relinquish 
or modify it to meet novel or changing conditions. This 
barnacle-like clinging to old habits should not be mistaken 
for a good memory. Quick forming and rapid modifications 
of neural connections are the undoubted attributes of 
superior intelligence, but quick modification occurs only 
when experience demonstrates that an old neural connection 
inadequately adapts the individual to his environment. 

4. Permanence of desirable neural connections. — It has 
just been pointed out that intelligence is closely correlated 
with ease of forming and breaking neural connections. 
Great permanence of desirable connections is also an index of 



Preparation and Validation of Test Material 213 

high intelligence. No great growth of intelligence could 
occur if each night of sleep wiped out the learning of the 
day as each night in Valhalla healed the wounds of the day's 
battle. There is a popular belief, fallaciously transferred 
from bank accounts to individuals' memories that ^^easy 
comes, easy goes," and hence superior intelligence cannot 
possess both superior plasticity and superior permanence. 
Not only ''to him that hath shall be given" but to him that 
hath has been given. For the adage that he who learns 
quickly forgets quickly is not based on facts but upon a sym- 
pathetic desire to comfort stupid folks. 

This somewhat theoretical analysis is followed by a set 
of more tangible working principles. Most of these princi- 
ples have consciously or unconsciously guided those who 
have constructed our basic intelligence tests. 

Suggestions for Intelligence Test Construction. — 
I. An intelligence test should be a learning test which ex- 
tends backward rather than forward. — These four — the 
number of desirable neural connections, the organization of 
these connections, the ease of forming and breaking connec- 
tions, and the permanence of connections — are the chief 
characteristics of an individual which a test must measure 
if it is to be a good intelligence test. Two methods have 
been proposed for testing these pre-eminently valuable char- 
acteristics. One method is to confront a pupil with a learn- 
ing situation which varies from very simple to very complex. 
A measurement of the number of points learned, the maxi- 
mum complexity of the thing that could be learned, the rate 
of learning, and the persistence of the things learned, would 
give a measure of the pupil's four prime characteristics. 
The inherent difficulties in conducting such learning tests are 
so great that another testing method is in almost exclusive 
use. This method takes samplings from the abilities which 
a pupil has, during his whole life, succeeded in developing. 
While this is also a learning test method, the learning test 
extends all the way from the present back to birth rather 
than from the present to a brief future. 



214 How to Measure in Education 

A single type of test will measure all four neural charac- 
teristics. Imagine two pupils of like chronological age. 
The pupil who forms and breaks neural connections more 
easily and whose connections are more permanent, will have 
the greatest number of neural connections. Hence, to com- 
pare the intelligence of the two pupils, all that is necessary is 
to make a determination of the relative number of their 
neural connections or mental abilities and of the efficiency of 
organization of these abilities or connections. 

2. An intelligence test should measure the largest pos- 
sible number of traits. — While it is probably true that every 
mental trait or set of neural connections boasts no aristo- 
cratic exclusiveness, but exemplifies even Nature's predilec- 
tion for democracy by partially combining with other traits 
to constitute a coordinating neural hierarchy, nevertheless, 
every trait retains a portion of its individuality or exclusive- 
ness! For this reason, the larger the number of traits 
measured, the safer the diagnosis. A test which measured 
but a few traits might happen to strike just those mental 
functions in which the pupil, for some accidental reason, was 
specially strong or specially wanting. The assayer takes 
many samples from many points in the ore bed. 

3. An intelligence test should measure samplings jrom 
the relatively more differentiating traits. — The ideal way is 
to measure every trait that contributes to intelligence, attach 
weights to the various traits according to the amount of their 
contribution to intelligence and add. The resulting sum 
would be a perfect measure of intelligence. Even if we 
knew how to test every trait and just what weights to attach 
to each, time would compel us to confine our attention to the 
more differentiating traits. 

But how may we know what are the differentiating traits? 
Let us proceed by the process of elimination. We can elim- 
inate those traits in which man is little or not at all superior 
to the animals. Certain elemental functions such as keen- 
ness of vision, hearing, and smell, speed of simple muscular 
responses like running and tapping and the excellence of the 



Preparation and Validation of Test Material 215 

neural functioning in connection with breathing, digesting 
and other organic functions, all these are of great importance. 
There are more failures from indigestion than this world 
dreams of. After a certain minimum these traits have little 
differentiating value. In them the brute and the stupid 
human are about the same as the genius. The intelligence 
tester will do well to steer clear of simple sensori-motor tests 
and seek out those traits which chiefly distinguish man from 
the brute, and genius from stupidity. Simple observation 
of these distinctions will point the way toward differentiating 
traits. 

Both observations and correlations with semi-satisfactory 
estimates of intelligence have indicated that intelligence tests 
should measure such traits as the ability to analyze a com- 
plicated situation, to attend to many elements at one time, 
to easily and effectively shift from one mental set to another, 
to deal with abstract symbols and relationships and the like. 
I ^ found that complex, difflculty tests show a closer correla- 
tion with intelligence than simple speed tests such as cancel- 
ling figures or letters, adding, copying and the like. The 
preceding psychological analysis would lead to just such a 
conclusion because it is the complex difficulty which indicates 
the efficiency of organization of a pupil's neural connections. 

It is not sufficient to select and test the differentiating 
traits. At best the traits tested will vary in differentiating 
power, or said differently, will vary in the amount of their 
contribution to intelligence. Thorndike ^ has pointed out 
that refined procedure requires each trait to be weighted 
according to its contribution to intelligence. 

Theoretically the weight attached will often differ not only 
from trait to trait, but will differ for different amounts of one 
trait. Just as one trait may contribute more to intelligence 
than another, so the same trait may not contribute anything 

* Wm. A. McCall, "Correlations of Some Psychological and Educational Measure- 
ments with Special Attention to the Measurement of Mental Ability"; Teachers 
College, Columbia University Contributions to Education, No. 79. 

* For an excellent discussion of how to weight traits see E. L. Thoradike, 
"Fundamental Theorems in Judging Men"; The Journal of Applied Psychology, 
March, 19 18. 



2i6 How to Measure in Education 

until it is present in considerable amount, after which each 
increase may contribute a proportional amount of intelli- 
gence until a certain point is reached beyond which further 
increases in the trait are attended by a decreased contribu- 
tion. When a man is seeking a wife, intelligence itself be- 
comes a trait contributing to wife-likelihood for any woman 
he meets. It is said that any increase in a woman's intelli- 
gence contributes nothing toward a man's desire to marry 
her until a certain minimum of intelligence is reached. Any 
increase in intelligence from this minimum to a certain max- 
imum, usually defined as on a par with the man's intelligence, 
makes her more desirable to him. But woe to the woman 
whose intelligence exceeds his maximum unless she is clever 
at camouflage! Beyond that maximum an inverse relation- 
ship obtains, or so it is said. Most of the scientific work 
that has been done with mental traits has, however, assumed 
a constant and not a changing relationship such as that of 
the foregoing illustration. Thorndike states in the above 
mentioned article: 

^'This assumption of proportionahty or rectilinearity in 
the relation line is behind most of the scientific work that 
has been done with educational and vocational selections. 
The technique of partial correlation coefficients and the 
regression equation, for example, assumes approximate 
rectilinearity of the relation lines." 

In attaching weights to traits the examiner should be 
specially careful lest he weight the same trait more than once. 
Suppose that we are assigned the task of selecting promising 
salesmen for life insurance. Suppose that among other 
traits we have tested language ability and mathematical 
ability, and that we wish to assign appropriate weights to 
each ability. If, due to complicated test instructions or 
roundabout descriptions of problems, the chief difficulty in 
the so-called mathematical tests were really language diffi- 
culty and not mathematical difficulty, assigning weights to 
the two tests would thus really mean the assignment of a 



Preparation and Validation of Test Material 217 

double and hence an undue weight to language ability. This 
error of weighting the same trait more than once is very com- 
mon. The error can be avoided only by a knowledge of 
each test's correlations. Even though two tests have differ- 
ent names, if the correlation between them is very close, we 
can be pretty sure that the difference is in name only. To 
quote from the same article by Thorndike: 

^'Do not weight the same contributing element twice be- 
cause it appears in tv^^o or more traits. Or, more adequately: 
Attach weights to elements according to the amount of their 
contributions, irrespective of the number of symptoms in 
which they appear. A full determination of independent 
elements and their contributions, whether experimental or 
by the method of partial correlation coefficients, is very 
difficult, and has never been made in any case to my knowl- 
edge. A complete determination is indeed not necessary for 
fairly efficient prophecy. Where the best possible weighting 
would prophesy fitness with say, a resemblance to demon- 
strated fitness of .95, a prophecy to the extent of .90 will 
commonly be reached by such a rough weighting as a com- 
petent thinker can devise in an hour or two upon inspection 
of the relevant correlations. So for practical purposes, we 
may translate the theorem into: Attach weights to traits 
only after a knowledge of their correlations.'' 

The examiner's object should be to test traits which are 
each closely correlated with intelligence, but not closely cor- 
related among themselves. Only thus is it possible to avoid 
a lop-sided view of intelligence. 

Weights should be attached to a trait not only in the light 
of its contribution to intelligence, and in the light of varia- 
tion in contribution with variation in the trait's amount and 
in the number of symptoms in which the trait appears, but 
also in the light of the trait's influence as affected by the 
amount of presence or absence of other traits upon which the 
trait in question depends. To illustrate, an increase in intel- 
ligence does not cause an increase in demand for one's 
services as a bank teller if one is dishonest. But let there 



2i8 How to Measure in Education 

be an increase in the trait honesty, and then increases in 
intelligence otherwise valueless become valuable. With the 
explanation that Thorndike is talking about a man's ability 
for a job rather than a pupil's intelligence we cannot do 
better than let him state this point and summarize the whole 
discussion. (It should be said that these principles apply 
to the measurement of any ability which is broad enough to 
be dependent upon subordinate traits.) 

''The dependence of the amount of influence of a trait 
upon the amount of some other trait possessed by the man 
will often be more complex than that described in the illus- 
tration. For example, the value of technical knowledge and 
skill for a certain job might vary as the square root of the 
man's intellect or as the square of his intellect. There may 
also be many sorts of irregular relations in these depen- 
dencies of one trait's influence upon the amount of another 
trait that is possessed. 

'T am not aware that any scientific investigation has 
measured any of these dependencies. There is, however, 
good reason to believe that they exist and are important. 
A man's possession of what we call energy, for example, 
seems to be a multiplier for his intellect or skill. A man's 
loyalty or devotion in any particular job seems to be a multi- 
plier of his other equipment for it. What we call roughly 
interest in success, or determination to succeed, or ambition, 
seems to be a multiplier for energy. And in theory, at least, 
v/e must admit it is a fundamental theorem that prophecy 
of the degree of influence of any amount of a trait requires 
consideration of the amount of every other trait in the man 
in question. 

"There are other principles concerning the use of facts 
in the selection of men which must be followed to make the 
best possible prophecy, but limitations of time prevent their 
discussion now. In the time that remains I may simply 
summarize our findings roughly and compare the scientific 
with the impressionistic or intuitional use of facts about a 
man. 

"We have seen that the status of a man in the traits rele- 
vant to fitness for a job may be expressed in an equation: 



Preparation and Validation of Test Material 219 

John Doe = 7a + gb + 4c + 2id. . . .A prophecy of 
his fitness relative to other men is obtained by attaching 
weights to a, b, c, d, etc., in view of (i) their relations to 
fitness, (2) their partial constitution by common elements, 
and (3) any dependencies whereby one gains or loses in 
influence according to the amounts of the others which are 
present. The setting up of an equation of prophecy from an 
equation of status will usually be very complex, but a rough 
approximation, if sound in principle, will often give excel- 
lent results. In so far as the lines of relation, interrelation, 
and dependency are rectilinear, the technique is greatly 
simplified; and a rough approximation to this is probably 
often the case. 

"Even when approximations are used and when, by good 
fortune, many of the relations concerned are rectilinear, 
sound procedure will still be more elaborate and difficult 
probably than has ever yet obtained in practice. Practice is 
thus justified to some extent in judging fitness experimen- 
tally by tests with dummy operations like those of the job 
in question, as well as analytically by a weighted combina- 
tion of contributing elements. 

"All this study of relation lines, inter correlations, facilita- 
tions and inhibitions and resulting weights by multiplying 
and adding, represents a scientific execution of just what 
the competent impressionistic or intuitional judge of men 
tries to do. The strength of such intuitional judgments in 
comparison with the formal systems of credits and penalties 
of the past has been, not that the intuition was less quan- 
titative, but that it was more so ! The formal systems of the 
past have used symptoms merely additively and often with 
only an ^all or none' credit. They have not allowed for the 
undue duplication of credits by inter cor relations, and have 
not sensed the importance of the multiplying effect of cer- 
tain traits upon others. The competent impressionistic 
judge of men does respond to these interrelations of the facts 
and sums up in his estimate a consideration of each in the 
light of the others. If there are ten traits involved, say ten 
entries on an application blank, he may be said to determine 
his prophecy by at least 10 + 9 + 8 + 7 + 6 + 5 + 4 + 
3 + 2 + 1 quantities, since he responds to each trait in re- 



220 How to Measure in Education 

lation to all the others. There is a prevalent myth that the 
expert judge of men succeeds by some mystery of divina- 
tion. Of course, this is nonsense. He succeeds because he 
makes smaller errors in the facts or in the way he weights 
them. Sufficient insight and investigation should enable us 
to secure all the advantages of the impressionistic judgment 
(except its speed and convenience) without any of its 
defects." 

4. An intelligence test should measure only those traits 
which every pupil has an equal opportunity to develop.-^ 
This means that the test material and methods of the test 
should be drawn from the social medium common to all 
children. Theoretically there should be equal opportunity 
to learn the test material, but practically about all that can 
be provided for is ample opportunity. Those traits should 
be measured which are least influenced by such differential 
agencies as school vs, non-school training, city vs. rural life, 
masculinity vs. femininity, luxury vs. poverty, etc. A coun- 
try boy might easily show up unfavorably in comparison with 
his city cousin in reacting to questions about elevators, sky- 
scrapers, subways and roller-skates, while the situation 
might be reversed if the questions dealt with hay-mows, disc- 
harrows, silos, dibbles, copperheads, and yellow jackets. 
Many of our pedagogical tests are increasing in value as 
intelligence tests because schooling is coming more and more 
to be a common environment for all children. 

5. An intelligence test should measure those traits growth 
in which is most universally motivated. — There are numer- 
ous situations in which all children move and have their 
being and which, according to a liberal definition of environ- 
ment, are a common environment to all children. Yet many 
of these situations make little or no appeal to a child unless 
he has an erratic nature or is on some way extraneously 
motivated. 

If a test incorporated such situations it would unfairly 
favor children with peculiar interests or peculiar training. 
The best guarantee of universal motivation is to select for 



Preparation and Validation of Test Material 221 

test elements, activities in which the largest number of chil- 
dren are instinctively interested. The accumulating records 
of observations on what activities are self-chosen by the 
children in the kindergarten and primary grades offer a 
rich mine of suggestions to makers of intelligence tests. 

6. An intelligence test should show a higher per cent of 
correct responses with each increase in chronological age. — 
This principle holds up to the age when intelligence matures, 
which is supposed to be not far from 18 years of age. The 
fundamental assumptions underlying the intelligence test as 
customarily used are that the total amount of knowledge, 
skill and power acquired by an individual (a) is a measure 
of his present intelligence, (b) is proportional to his inherited 
intelligence, and (c) is prophetic of his future intelligence. 
These three points mean that if one infant has a native en- 
dowment twice that of another child, he will develop propor- 
tionately faster until intelligence matures and hence will at 
every stage of his life be proportionately superior to the 
originally inferior individual. 

*§. An intelligence test should measure the ability to 
trdnsfer training. — One of the great advantages of possessing 
numerous neural connections which are effectively organized 
is that they guarantee wide-scale transfer. The genius 
makes everything grist which comes to his mill. He can 
transfer both Latin and Algebra to just about anything. 
The stupid individual, on the contrary, lacks this nimbleness 
of wit. He can be trained but can be educated only with 
difficulty. He would make a fairly good showing if the test 
contained material upon which he had had direct training. 
When the test presents tasks for which he has had no specific 
training his existing neural connections are unable to deal 
with the new situations. This difference between indi- 
viduals in their power to deal with situations for which they 
have had no specific training is so significant and marked 
that intelligence might well be defined as the power to trans- 
fer training. An intelligence test which does not measure 
this ability is certainly imperfect. 



222 How to Measure in Education 

Current Methods of Measuring Intelligence. — If the 
reader will think of the following quotation from Terman ^ 
as applying to pupils instead of to soldiers and adults, he 
will have an admirable statement of current methods of 
measuring the general intelligence of school children. 

"The Method of Trying Out. — One method would be to 
try out each soldier in tasks of various degrees of difficulty. 
The method is fairly sure. It could probably be depended 
upon to give us an efficient army within a few years. 

"So with the method of natural sifting. If the war 
should last long enough the best men would pretty certainly 
in time demonstrate their ability and rise to positions of re- 
sponsibility, while those of poorest mentality would gravitate 
to the humbler tasks. But the gravitational method of sift- 
ing meets many resistances, and like the method of trying 
out, is necessarily very slow. 

"Pseudo-scientific Methods of Rating Mentality. — A cen- 
tury ago a French physiologist, by the name of Gall, 
founded what he thought was a new science, which was 
named phrenology. According to phrenology, definite and 
constant relations were believed to exist between certain 
mental traits and the contour of the head. It was believed, 
for example, that one's endowment in such traits as intelli- 
gence, combativeness, sympathy, tenderness, honesty, re- 
ligious fervor, and courage, could be judged by the promi- 
nence of various parts of the skull. 

"It is unnecessary to question the sincerity of Gall and 
his over-enthusiastic followers. They were probably not 
guilty of conscious deception, but merely blinded by an 
attractive theory. At any rate, the 'science' of phrenology 
has been hopelessly exploded. It has been well demon- 
strated: 

"i. That traits like those mentioned above do not have 
separate and well-defined seats in the brain, and 

"2. That skull contour is not a reliable index of the brain 
development beneath. 

"In the underworld of pseudo-science, however, phren- 

" Lewis M. Terman, "Tests o£ General Intelligence," Psychological Bulletin, 
May, 1918. 



Preparation and Validation of Test Material 223 

ology and kindred fakes still survive. Hundreds of men 
and women still earn their living by ^feeling bumps on the 
head/ reading character from the lines of the hand, etc. 
But modern warfare has no time for pseudo-science. A gen- 
eral would no more think of selecting his officers by 
phrenological methods than of substituting incantations for 
gunpowder. 

''The Method of Off-hand Judgment. — But if in the rating 
of men pseudo-science is misleading, perhaps science is still 
unnecessary. It may be argued that mental traits can be 
rated accurately enough for all practical purposes on the 
basis of ordinary observation of one's behavior, speech and 
appearance. We are constantly judging people by this off- 
hand method, because we are compelled to do so. Conse- 
quently we all acquire a certain facility in handling the 
method. For ordinary purposes it is infinitely better than 
nothing. A skillful observer can estimate roughly the height 
of an aeroplane; but if we would know its real height we 
must use the methods of science and perform a mathemati- 
cal computation. 

''The trouble with the observational method is its lack of 
a universal standard of judgment. One observer may use 
a high, another a low standard of comparison. A four-story 
building in the midst of New York's 'sky-scrapers' looks 
very low; placed in the midst of a wide expanse of one-story 
structures it would look very tall. The captain of a very 
superior company may rate his least intelligent man as 'very 
dull'; the same man in a very inferior company would likely 
be rated as 'average' or better. 

"Moreover, we are easily misled by appearances. The 
writer knows a young man who looks so foolish that he is 
often mistaken by casual acquaintances for a mental defec- 
tive. In reality he is one of the half dozen brightest students 
in a large university. Another man who in reality has the 
mentality of a ten-year-old child, is so intelligent looking 
that he was able to secure employment as a city policeman. 

"Language is a great deceiver. The fluent talker is likely 
to be over-rated, the person of stumbling or monosyllabic 
speech to be underrated. Similar errors are made in judg- 
ing the intelligence of the sprightly and the stolid, the 



2 24 How to Measure in Education 

aggressive and the timid, etc. Our tendency is also to over- 
estimate the intellectual quality of our friends and to under- 
estimate that of persons we do not like. 

^'If the method of off-hand judgment were reliable, 
different judges would agree in their ratings of the same 
individual. When the judges disagree it is evident that not 
all can be correct. When intelligence is rated in this way 
wide differences of opinion invariably appear. Twenty-five 
members of a university class who had worked together 
intimately for a year were asked to rank the individuals 
of the class from i to 25 in order of intelligence. The re- 
sult was surprising. Almost every member of the class was 
rated among the brightest by someone, and almost every 
member of the class among the dullest by someone. Doubt- 
less the judges were misled by all kinds of irrelevant 
matters, such as personal appearance, fluency of speech, 
positiveness of manner, personal likes and dislikes, etc. 
Think how much error there would be if a company com- 
mander were rating the intelligence of 250 men newly 
assigned to his command. 

''The method of personal estimate is much better than 
the method of external signs (phrenology), but to be reli- 
able it must be supplemented by a method which is objec- 
tive, that is, a method which is not influenced by the personal 
bias of the judge or by such irrelevant factors as appear- 
ance, speech, or bearing of the one to be rated. Such is the 
method of intelligence tests. 

''Intelligence Tests a Method of Assaying Mentality. — 
A man wishes to find out the value of a gold bearing vein 
of quartz. How shall he set about it? One way would be 
to uncover all the ore and extract every ounce of gold 
contained in it. It is hardly necessary to point out that this 
would be a slow and risky procedure, one that might easily 
cost a fortune and bring small returns. But granting that 
the extent of the quartz vein was known and that the cost 
of bringing it to the surface could be calculated, would this 
be sufficient to tell us the value of the mine? The answer 
is obvious; something depends on whether the quartz con- 
tains many dollars' worth of gold or only a few pennies' 
worth, per ton of ore. 



Preparation and Validation of Test Material 225 

"However, the next step is easy. It is only necessary 
to take a few random samples of the ore to an assayer, who 
makes a simple test and returns the verdict of so many 
ounces of gold per ton of rock. The verdict of the assayer 
may justify the expenditure of a million dollars or it may 
tell us the mine is not worth a penny. At any rate the 
question of value is answered. 

"Suppose the question before us is not the value of a gold 
bearing vein of quartz, but the intellectual quality of a 
human mind. If we are to rate the quality of a man's in- 
telligence will it be necessary to make this intellect perform 
every act of which it is capable in order that these may be 
added together for a total intelligence rating? This would 
be one method of answering the question, but a rather 
tedious one, considering the innumerable acts which a 
human mind is able to perform. Perhaps this is not neces- 
sary. Conceivably it might be possible to sink shafts, as 
it were, at certain critical points, and by examining a few 
samples of the mind's intellectual product to estimate its 
intrinsic quality by a method analagous to that of the 
assayer. 

"Such is the method employed in all systems of testing 
intelligence. The mind is given a number of ^stunts' to 
perform, each of which requires the exercise of intelligence. 
By the quality of these the quality of the entire mind is 
judged. The tests tell us whether the mind in question is 
one of rich content and rare intellectual power, or 'whether 
it is mediocre or perhaps even defective. 

"Collecting Samples for Assaying. — In ascertaining the 
value of the gold deposit would it be safe to take all the 
assayer's samples from a single part of the quartz vein? 
Common sense would of course suggest the precaution of 
taking samples from many places and of estimating the gold 
content in terms of average richness. Similarly in testing 
intelligence the subject is not asked to perform one intel- 
lectual 'stunt,' but many. He may be given tests of memory, 
of language comprehension, of vocabulary, of orientation 
in time and space, of ability to follow directions, of knowl- 
edge about familiar things, of judgment, of ability to find 
likeness and differences between common objects, of arith- 



226 How to Measure in Education 

metical reasoning, of resourcefulness and ingenuity in 
practical situations, of ability to detect the nonsense in 
absurd statements, of speed and richness of mental associa- 
tions, of power to combine related ideas into a logical whole, 
of ability to generalize from particulars, etc. The average 
of a large number of performances thus gives a kind of 
composite picture of the subject's general intelligence" 



CHAPTER VIII 

ORGANIZATION OF TEST MATERIAL AND 
PREPARATION OF INSTRUCTIONS 

I. Factors Which Should Influence Arrangement 

Test Forms. — An important requirement in test con- 
struction is that the scoring be as economical, accurate and 
objective as possible. There are two ways of meeting these 
requirements. The first way is the proper construction of 
the test, the second way is the proper construction of scoring 
devices. 

In test construction the prime requisite from the point of 
view of scoring is that those pupil reactions to the test which 
are to be scored be as simple, abbreviated and controlled as 
possible, and that the reactions have a definite spatial loca- 
tion. With the exercise of some ingenuity a pupil's most 
complicated mental processes can be measured even when he 
reacts to each test element with no more than a word, a 
letter, a check, a number or the like. The excellence of the 
pupil's solution of a long reasoning problem in arithmetic 
can be condensed into a few figures — the answer. If the 
pupil's reactions are simple and abbreviated they can be 
scored very rapidly and accurately, and with very little dis- 
agreement among the scorers. 

Again, a test must also so control these reactions that only 
one kind of simple reaction will be correct. If any one of 
ten different words, or letters or numbers is correct, scoring 
will be greatly slowed up and judgment must be more and 
more exercised and the net result is uneconomical, inaccu- 
rate, and subjective scoring. If only one reaction is correct 
for a given test element it is possible to make out a set of 
correct answers. These correct answers may be placed 

227 



228 How to Measure in Education 

beside a pupil's answers, and then scoring becomes merely a 
matter of making simple, unthinking, visual comparisons. 

Finally, the test must be so constructed as to give a defi- 
nite spatial location to a pupil's answers. In any case this 
is a decided convenience ; it is particularly so when a pupil's 
reactions all consist of a check mark or an underlining, where 
correctness depends not so much upon what is done as where 
it is done. Spatial location is secured by the provision of 
a square, circle, or other special place where the pupil is to 
make his mark. Consider, for example, how long it would 
require to announce the results of a presidential election 
if ballots did not spatially locate the voter's vote. 

The problem of constructing a test so that scoring will be 
efficient is shown by the following evolution of an extract 
from a test for military aviators which the writer aided 
Thorndike in constructing. (Instructions are omitted.) 
Note first that the nature of the test question permits a long, 
qualified, unscorable answer. Note second that there is no 
prescribed place where the answer must be written. This 
test element is a perfect illustration of what not to do. 

I. Compare the lines as they were be j ore with what 
they are now. 

The test element is restated in better form below, though 
it is still inexcusable. Note that the nature of the test ele- 
ment encourages a briefer answer, and tends to control the 
type of answer. 

I. Are the lines shorter than they were be j ore, longer 
than they were before, or the same as they were before? 

The test element is restated again in a still better form, 
the aviators were instructed to write the appropriate num- 
ber in the parenthesis as I have done in the illustration. 
Note that the answer is simple, abbreviated, controlled, and 
located somewhat apart from the statement of the question. 



Organization of Test Material 229 

J. Are the lines (i) shorter than they were before, (2) 
longer than they were before, or (j) the same as they were 
before? (2) 

The above is the first form in which the question was 
actually stated. Note that a column of correct answers, 
properly spaced, could be placed beside a column of an 
aviator's answers in such a way that all errors could be 
detected with great accuracy and rapidity. 

But there is a decided defect in even the above form. 
Suppose the lines or trenches really are (2) i. e. longer than 
they were before. For the aviator to report to the Intelli- 
gence Officer that the lines are shorter than they were before 
is to make a more serious mistake than if he were to report 
that they are the same as they were before. The former 
should be penalized, say, two points and the latter only one 
point. Consider how the following re-arrangement facili- 
tates the assignment of the proper amount of penalty. 

I. Are the lines (i) shorter than they were before, (2) 
the same as they were before, or (j) longer than they were 
before? (^) 

Since in this case the answer should be the lines really are 
longer than they were before, the correct answer is 3. If 2 
is found in the parenthesis, it should be penalized i point. 
The difference between 3 and 2 is i point. If i is found in 
the parenthesis, it should be penalized 2 points. The dif- 
ference between 3 and i is 2 points. Thus the test element 
has been so constructed that the difference between the cor- 
rect number and the number appearing in the parenthesis 
gives instantly the proper amount of penalty. Without such 
simplification of scoring the extensive use of mental tests in 
the military service during the war would not have been pos- 
sible, nor would there be great promise for their future use 
in education. 

Below are extracts from a variety of tests, which illustrate 
how not only tests but ordinary examinations can be so con- 



230 



How to Measure in Education 



structed as enormously to reduce the inaccuracy, subjectiv- 
ity, and time of scoring. 

EXTRACT FROM RUGER'S PROVERBS TEST 

DIRECTIONS: In column No. i write opposite each English proverb 
the number of the African proverb which most nearly means the same 
thing as the English proverb (see below for African proverbs). (Do not 
write any number twice — omit no number — write only one number oppo- 
site each letter.) 



Column 

I 



English Proverbs 





a. 
b. 
c. 






First catch your hare. 




Curses come home to roost 




Milk for babes. 



African Proverbs 

1. Ashes fly in the face of him who throws them. 

2. I nearly killed the bird. No one can eat nearly in a stew. 

3. If the stomach is not strong, do not eat cockroaches. 



EXTRACT FROM THORNDIKE'S MENTAL ALERTNESS TEST 

Make a cross in the square before the best answer to each question. 



I. Why are prunes a good food? 
Because they 



4. When you feel that affairs in 
your town are badly man- 
aged, should you? 



grow in California 

are wholesome and eco- 
nomical 

are served in boarding 
houses 

make an attractive dish 



do nothing at all 

growl to your friends 
get out and work to have 
things changed 

go to church 



EXTRACT FROM PRESSEY'S MENTAL SURVEY TEST^ 



X. Analogies. 

Examples: 



girl — woman : boy — man 

sun — day : moon 

good — bad : big . , 



1. woman — ^girl: man 11. hill — valley: high. 

2. kitten — cat: puppy 12. arm — elbow: leg.. 

3. sky — blue: grass 13. truth — falsehood: 

straight line 



^ Issued by S. L. and L. W. Pressey, University of Indiana, Bloomington, Ind. 



Organization of Test Material 231 

EXTRACT FROM GREENE'S ORGANIZATION TEST" 

Write numbers in these spaces 



(i) (2) (3) 

a dog, a boy, had. 



(i) (2) (3) 

of the cold, afraid, they were.... 

(I) (2) (3) 

I am, see, how tall 



EXTRACT' FROM OTIS' GROUP INTELLIGENCE SCALE 

Memory 

DIRECTIONS: Read each question and if the right answer, according 
to the story, is Yes draw a line under the word Yes. // the right answer 
is No, draw a line under the word No. But if you do not know the right 
answer, because the story didn't say, draw a line under the words Didn't 

SAY. 



Sample: 



Was the story about a king? 

Was the king's daughter sixteen years old? 

Was she ugly ? 



yes no didn't say 
.yes no didn't say 
yes no didn't say 



Begin here: 

1. Was the king fond of hearing stories? 

2. Did the king offer his daughter to any one 

who could tell him a story that would 
last forever? 

3. Did he offer all his kingdom also? 

4. Did he say, "but if he fails he shall be cast 

into prison"? 



(yes no didn't say) i. 



(yes 
(yes 



no 
no 



didn't say) 
didn't say) 



(yes no didn't say) 4. 



Mechanical Scoring Devices. — Since scoring is 
greatly facilitated by mechanical scoring devices and since 
the possibility of employing such devices is dependent upon 
the form of arrangement of the test material, a brief discus- 
sion of these devices is pertinent at this point. 

There are many forms of these mechanical devices depend- 
ing upon the form of the test which they are designed to 
score. When all the pupils' answers are written at a definite 
place on the right or left edge of the test sheet a convenient 
device is a test sheet which has been correctly filled out by 
the scorer. The key sheet can be so superimposed on the 

2 Issued by S. A. Courtis, 82 Eliot Street, Detroit, Mich. t.t ^^ , 

2 Copyrighted 19 19 by World Book Company, Yonkers-on-Hudson, New York. 
Used by permission of publishers. 



232 How to Measure in Education 

pupil's sheet that nothing but the pupil's column of answers 
shows immediately beside the correct answers. Such a scor- 
ing device can be used even when the pupils' answers are 
written between the lines. A test sheet is correctly filled 
and then all but the answers are cut away. The pupils' 
answers show through the resulting spaces, or the scoring 
sheet may be rolled and unrolled up and down the page. It 
is well to use a test sheet for the scoring device, because it 
quickly and automatically provides for proper spacing. If 
a more durable and less flexible form is required, the key 
answers may be placed on a card-board or even more durable 
material. 

Again, there are tests of such a nature that what the pupil 
does is relatively insignificant but where he does it is all- 
important. Such are tests where the pupil is instructed to 
underscore the appropriate word, or check the appropriate 
reason, or cancel the appropriate letter, etc. The scoring 
device already described may be used to advantage in this 
situation, but some form of transparent sheet frequently 
works better. Celluloid or any kind of transparent material 
may be placed over a correctly filled test, and a dot can be 
made on the celluloid sheet just over the place which is 
correct. The transparent sheet may then be used for scoring 
the pupils' answers. Otis makes an extensive use of just 
such a device for scoring his group intelligence test. 

Finally, if the test is so constructed that scoring will be 
facilitated by making all of the pupil's test sheet invisible 
except the spot where the correct answers should be, small 
apertures may be cut through a blank sheet at such places 
that only the correct-answer spots will be visible. The same 
result may be secured by placing a sheet of celluloid over a 
test sheet and by so painting the celluloid with black paint 
that nothing but the desired spots will be visible. These 
perforated scoring devices may also be used to facilitate the 
Counting separately of items mixed in one test. The Woody- 
McCall Fundamentals of Arithmetic Test/ Form I and 

* Bureau of Publication, Teachers College, N. Y. C. 



Organization of Test Material 233 

Form II, has addition, subtraction, multiplication, and di- 
vision examples so mingled on one test sheet that the pupil 
is frequently forced to shift his processes, and often to decide 
by the nature of the sign just what sort of an exaniple it is. 
In the instructions which accompany this test, I suggest that 
the computation of a separate score for each fundamental, 
if desired, may be facilitated by perforating four fresh test 
sheets. The first sheet should be so perforated that when 
placed over the pupil's test only addition examples are 
• visible. The second sheet should make visible only sub- 
traction examples; and multiplication and division should 
be treated similarly. 

Group vs. Individual Testing. — The nature of certain 
tests and the illiteracy of young pupils has required indi- 
vidual testing, i. e., the testing of one pupil at a time. The 
nature of other tests and the liter ac}^ of older pupils permits 
group testing, i. e., the testing of many pupils. 

A heated controversy has been going on concerning the 
advantages and disadvantages of each method of testing, and 
this controversy continues in spite of the fact that skillful 
test constructors have now adapted almost all varieties of 
tests to permit group testing of illiterates. 

Even when group testing is feasible, it is claimed that a 
more accurate diagnosis can be made when each pupil is 
tested individually. This claim is based upon the assump- 
tion, first, that the appearances and incidental reactions of 
a pupil are valuable indices of his special defects or special 
strengths gind that these indices are observed better during 
an individual examination. The second assumption is that 
the examiner can better select for each pupil those tests 
which will reveal significant symptoms, for it often happens 
that some reaction on the part of the pupil will give the 
examiner a "lead" which it is highly desirable to follow up. 
Such rapid adaptations are manifestly impossible in group 
testing. Finally, some examiners hold that testing condi- 
tions can be more carefully standardized by individual test- 
ing. Early psychological investigators considered them- 



234 How to Measure in Education 

selves unusually virtuous when they took time to administer 
all tests individually ''with special care," as they said. 

Group measurement is enormously economical in time. 
To administer a thirty-minute individual test to a group of 
500 pupils would, when all wastage is counted, take about 
300 hours of the examiner's time, whereas, under certain 
circumstances, a thirty-minute group test could be admin- 
istered to all pupils in about forty minutes. Even though 
the 500 pupils were tested in groups of only fifty, a great 
saving of time would be effected. It is this great expense 
in time that has delayed educational measurement in the 
kindergarten and primary grades. The economy of group 
testing is further illustrated by the psychological examina- 
tion of soldiers during the war. Several tests were given to 
many hundreds of thousands of soldiers. Each test could 
have been administered individually to each recruit. To 
have done so with the staff available would have required all 
the years of the war, when speed was imperative. Substan- 
tially the same situation confronts those who are introducing 
measurement into education. It is useless to attempt the 
measurement of millions of pupils with individual tests. To 
a very large extent educational measurement must be group 
measurement. 

Group testing may, under certain conditions, be fairer to 
the pupils tested. In experimentation it is often important 
to know the amount of change made by each pupil in a class 
during two weeks. It might take a single examiner a week 
to test every child by the individual method. The last 
pupils tested would thus have an extra week's advantage if 
learning were being measured, or a week's disadvantage if 
forgetting were being measured. Again, a test is often of 
such a nature that one pupil can partially prepare another. 
The first pupils tested can then spread information through 
the entire class or school. Finally, for some tests, it is 
especially difficult to standardize the personal equation of the 
examiner. Such a variable operates to the advantage of 
some pupils and to the disadvantage of others. Group test- 



Organization of Test Material 235 

ing makes the personal equation more nearly constant for all 
pupils within the group. 

What then is the conclusion of the whole matter? Indi- 
vidual testing and group testing each secure special values. 
The method adopted in the psychological examination of 
soldiers will probably come into common use in all educa- 
tional measurement whether done for purely pedagogical or 
clinical purposes. The initial tests given the soldiers were 
group tests. These revealed the illiterates and those who 
were in some way abnormal. The illiterate and abnormal 
groups were then intensively measured with individual tests. 
The diagnoses afforded by the group tests were accepted for 
the vast majority of the recruits. In time school psycholo- 
gists will not wait until abnormal cases are sent to them for 
diagnosis. They will sweep through the schools with a net 
of group tests and catch their own cases for intensive study. 
Even for the special cases, what with the development of 
group tests for illiterates, it is worth considering whether 
the greater number of group tests which may be given within 
an equal time-interval may not give a better diagnosis than 
the fewer individual tests. A good practical rule is to first 
give group tests, accept their diagnosis for most of the pupils 
and give further group or individual tests to the few pupils, 
who, according to the group tests, need special study, 

II. Guiding Principles in the Preparation of 

Instructions 

I. Instructions Should Be us Brief as Is Consistent 
with an Adequate Understanding of What Is to Be 
Done. — Besides consuming time, inordinately long in- 
structions tend to produce confusion in the minds of the 
pupils. Even adults find difficulty in following through com- 
plicated instructions. It has been demonstrated frequently 
that even among so intelligent a group as school teachers 
there are always a few who cannot follow very simple direc- 
tions. Long instructions so tax the memories of pupils that 



236 How to Measure in Education 

absolute essentials are frequently forgotten. To forget a 
single one of these essentials may markedly alter the child's 
score. Brevity is frequently sacrificed to pure ir relevancies. 
It is well to remember that the primary function of instruc- 
tions is to give a pupil adequate, but not necessarily com- 
plete, information about the test. Their primary function 
is not to give the pupil a general education. To quote a 
remark by Patterson, ''Test! Don't teach!" 

Again, the longer we make the instructions, the more we 
add to the confusion of inexperienced examiners. The 
novice is never quite sure of himself unless the instructions 
are sufficiently brief that his memory span can embrace not 
only every step of the process, but also the proper sequence 
of the steps. The untrained examiner cannot give his sole 
attention to instructions. He must maintain order among a 
roomful of naturally disorderly creatures, keep track of his 
watch, handle the test sheets, see that preceding instruc- 
tions are being followed and the like. It is a real kindness 
to both examiner and pupils to make instructions no longer 
than is necessary. 

But inadequate instructions are as bad as or worse than 
instructions which are too long. Inadequate instructions 
may wholly defeat the purpose of the test, or precipitate an 
avalanche of questions from the pupils. Instructions can- 
not be cut out of whole cloth. It requires both forethought 
and experimentation to produce instructions which will cause 
the pupils to do just what is wanted of them, and which will 
anticipate questions by the pupils. 

The omission of some points would be more disastrous 
than others. What the essential key points are depends, 
of course, upon the test. In the Thorndike Vocabulary 
Scale, for example, it is especially important that pupils be 
warned not to skip any words by accident. This is because 
the statistical method of computing scores for this test 
treats accidental omissions as though they were errors, and 
weights them very heavily. Below are a few quotations 



Organization of Test Material 237 

from existing test instructions which are key points. "As 
soon as you complete the first sheet, hold up your hand, and 
I'll give you a second one." "Read as rapidly as you can 
to still understand what it says." "Don't read anything 
over again." "You will have just one minute." "This is an 
addition test." "Check each sum before passing to the next 
example." "When I call 'stop/ draw a circle around the 
last word read." "You will be asked to reproduce from 
memory what you have read." "Your score will be the 
number of examples you get right." "You will be marked 
on both speed and quality." "Write your name and grade." 
Some key points are so obvious that they will be recognized 
by anyone. Some are so subtle that only the intuitive or 
trained examiner can detect them. In sum, instructions 
should be as brief as possible, as adequate as is essential, 
and always consistent with the subsequent uses of results. 

2. Instructions Should Employ a Demonstration 
and Preliminary Test. — An ounce of demonstration is 
worth a pound of words! It takes more words to describe 
effectively what is to be done than it takes moves to show 
what is to be done. Anyone can try for himself an experi- 
ment to discover whether it is easier to show than to tell. 
Probably due to primordial practice, children, not to men- 
tion adults, can imitate better than they can comprehend 
and follow linguistic directions. To accompany description 
with a demonstration not only caters to pupils who may get 
impressions easier through the eye or through the ear, but, 
what is more important, it gives to all an impression through 
both eye and ear. Demonstration has the still further ad- 
vantage of securing better attention, especially from the 
young children. 

The demonstration may take any of several forms. In 
one test the examiner writes a sample test element on the 
blackboard and works it out for the pupils just as they are 
to work out similar tasks contained in the test. But in most 
tests which employ the demonstration method, sample test 



238 How to Measure in Education 

elements correctly completed are printed on the test sheet. 
Here is an example of instructions for a test accompanied 
by such a completed sample. 

"This is a test of common sense. Below are sixteen questions. Three 
answers are given to each question. You are to look at the answers care- 
fully ; then make a cross in the square before the best answer to each 
question, as in the sample: 

Why do we use stoves? Because 



SAMPLE 



X 



they look well 
they keep us warm 
they are black 



"Here the second answer is the best one and is marked with a cross. Be- 
gin with No. I and keep on until time is called." 

Thorndike has devised a novel test. This test attains the 
maximum of showing and the minimum of linguistic direc- 
tions. So much is this the case that it may ,well be called a 
pantomime test. The whole test can be given without the 
reading or the speaking of a word by anyone. The test 
was devised, in fact, to measure the intelHgence of army re- 
cruits who were illiterate Americans and immigrants who 
did not even understand spoken English. The recruits were 
given a test sheet containing diagrams, pictures, etc. The 
examiner placed before the recruits an enlarged form of the 
test which was similar to, but not identical with, the test 
in the hands of the recruits. The examiner did the enlarged 
test with a heavy crayon. The examiner's movements 
showed the recruits what they were to do with their own test 
sheet. This is a most ingenious test, but, when there is a 
common medium of communication, the best method of giv- 
ing instructions is not by demonstration alone, nor by lin- 
guistic description alone, but by a happy combination of 
both. 

When instructions are at all complex, they should, as a 
rule, be accompanied by a preliminary test. Even though 
every possible precaution be taken to make all pupils under- 



Organization of Test Material 239 

stand just what they are to do, one can never be quite sure 
that all do understand unless a preliminary test is given. A 
preliminary test has the additional advantage that pupils 
can make most of their test adjustments before beginning 
the test proper. Due to differences in nervousness, intelli- 
gence, etc., some pupils adjust quickly and some slowly. If 
there is no preliminary test, and if the time for the test is 
relatively brief, the rate of adjustment may materially influ- 
ence the score, even when we are usually not primarily con- 
cerned with the measurement of tjiis factor. The prelim- 
inary test should typify the nature of the test elements 
proper. ', 

This preliminary test may be presented in various ways. 
Sometimes the examiner writes one or more t5^ical test ele- 
ments on the blackboard and the children do them more or 
less in concert. Obviously this method does not give the 
examiner a sure guarantee that each pupil understands what 
is expected of him. 

A second method is to give each pupil an easy miniature 
test. The examiner can then go about the room and observe 
whether each pupil shows an understanding of instructions. 
The examiner can help any pupil do the first element or two 
if he does not understand. If this does not suffice, the pupil 
can be assumed to be incapable of doing the test at all. 
Such special preliminary tests have not come into general 
use because of the expense involved in printing extra test 
sheets and the time required for their distribution and col- 
lection. 

A third method is to print the preliminary test on the back 
of the regular test sheet along with the instructions or to 
reserve the front page of a booklet for instructions and pre- 
liminary test. This method is most satisfactory of all. Its 
use is not universal because of the greater expense involved 
in printing on both sides of a test sheet or making a booklet. 

A fourth method is a little less satisfactory and, as a com- 
pensation, less expensive. The instructions, demonstrations, 
and preliminary practice test can be printed on the same side 



240 How to Measure in Education 

of the sheet as the regular test, but clearly separated from 
the regular test. Pupils can be instructed to do the practice 
test, but not to begin the regular test until their work on 
the preliminary test has been inspected and they have re- 
ceived the signal to start the test proper. It is difficult to 
prevent a premature mental start. If the test is a rate test 
such a premature start may be a serious factor. 

A fifth method has been used. When the time element 
is not important, the elements of the preliminary test may be, 
so far as the pupil is informed, the first few elements of the 
regular test. After the test has been started the examiner 
can go about the room and give any needed help on the pre- 
liminary elements. In this case the preliminary elements 
will not be counted in determining the pupils' scores. 

Sometimes practically all the advantages of all the meth- 
ods can be secured by folding back the preliminary portion 
of the test in such a way as to conceal the regular test while 
the prehminary test is visible. This permits printing the 
test by a single impression, and thus reduces expense. If 
expense is not, however, a consideration, the folder or book- 
let test, with the entire front page exclusively reserved for 
name, grade, age, instructions and preliminary test, is prefer- 
able. 

Finally, it remains to be pointed out that there are cases 
where the preliminary test cannot be used at all or its use 
must be carefully guarded. No one will, of course, make the 
mistake of using as a preliminary test element, an element 
which is identical with one in the regular test. To do this 
would be to teach the test. Again, there are instances where 
the purpose of the test is to discover whether the pupil pos- 
sesses the slightest trace of an ability. A preliminary test 
with the examiner's assistance, or even a demonstration 
might give sufficient instruction to defeat the purpose of the 
test, especially if the ability is one which can be quickly 
taught. This would be true for many elements of the Binet- 
Simon Test. The common sense of the examiner can be 
trusted to decide when a demonstration or preliminary test 



Organization of Test Material -241 

of a given kind is pernicious. All that is necessary is to keep 
in mind the fact that demonstrations and preliminary tests 
are not only clarifying but educating agencies. 

3. Instructions Should Be Adapted to and Uniform 
for All Who Are to Be Tested, — How much adaptation 
is essential? In the testing of school abilities, the instruc- 
tions for the test should be so simple that all may under- 
stand them. The instructions should be such that no child 
will fail to make a score just because he failed to compre- 
hend the instructions. Fully to reahze this aim, the instruc- 
tions should be no more difficult than the first or easiest test 
element. If instructions are more difficult than the first test 
element, some pupil may make a test score of zero when he 
might at least have made a small score. In such a case the 
pupil has not been tested by the test but by the instructions. 

How much uniformity is essential? Instructions contain 
mechanical and non-mechanical features. The mechanical 
phase has to do with getting the pupil's name, sex, age, 
grade, etc. Uniformity is not necessary because the impor- 
tant thing is to get this data of identification, even though 
it is necessary for the examiner to so vary the procedure as 
to write the pupil's name for him. The mechanical features 
do not assist the pupil with the test proper. 

The non-mechanical features do determine to a certain 
extent, and frequently to a large extent, the score a pupil 
will make. It is far more convenient if these instructions 
are uniform from grade to grade. To cite one illustration, 
tests are frequently used in rural schools where several 
grades and many ages are grouped in one room. An exam- 
iner can test all these pupils at once if the instructions are 
uniform. Hence it is best for instructions to be both 
adapted to and uniform for all the pupils in all the grades. 

The intelligence examiner will grumble because I have 
not been even more enthusiastic for absolute uniformity. The 
intelligence examiner frequently has only a minor interest 
in knowing whether failure on the part of the pupil is due 
to lack of comprehension of the instructions or due to the 



242 How to Measure in Education 

inability to do the test elements. His primary interest is 
to find out whether the child possesses sufficient intelligence 
to deal with the total situation. And therefore the measurer 
of general intelligence may be right in contending that in- 
structions should be absolutely uniform for all ages. Other- 
wise the total situation would not remain constant. 

But it is unwise to carry over to pedagogical measurement 
a theory which is inapplicable. When an educator gives a 
vocabulary test, he is, as a rule, primarily interested to know 
what the pupil's vocabulary is, and only incidentally inter- 
ested to determine whether the pupil possesses sufficient 
general intelligence to understand the instructions or over- 
come the mechanical difficulties of the form of the test. If 
a teacher measures her pupils' ability to add, she wants to 
know how well her children can add. She is not then inter- 
ested in knowing how well they can understand her direc- 
tions, or read printed instructions. She wishes to reduce 
these ir relevancies to a minimum. Only in a test of reading 
ability is it perfectly legitimate to make the instructions an 
integral part of the test itself. Nor is this primary interest 
peculiar to education. Many psychological tests which are 
designed primarily to measure intelligence prefer to measure 
it by means of the test material rather than by the instruc- 
tions. 

If the above distinction is sound it is legitimate to con- 
struct different instructions according to the age and ability 
of the pupils, provided v/hatever instructions are used give 
in every grade an adequate understanding of what is to be 
done, which means that if sixth-grade instructions are more 
difficult than third-grade instructions, the former must still 
be easy enough for each sixth-grade pupil to understand 
what he is to do. In essence this means that in pedagogical 
measurement adaptation has priority over uniformity. My 
thesis required both adaptation and uniformity because I 
think it is possible to secure both at once. 

But it is frequently contended that there is no possibility 
of securing adequate adaptation together with uniformity. 



Organization of Test Material 243 

It is claimed that the two characteristics are mutually an- 
tagonistic and that we cannot have our cake and eat it too. 
In recognition of this claim, Trabue in his Language Scales 
uses additional instructions and practice material for the 
younger children. 

It is held by some that words which are appropriate for 
third-grade pupils would insult eighth-grade pupils and 
words appropriate for eighth-grade pupils would be beyond 
the comprehension of younger pupils. It may easily be 
doubted that third-grade children appreciate "baby talk" 
as much as is claimed. Nor is it impossible to find words 
sufficiently simple that younger pupils will understand them 
and at the same time so dignified that older pupils will not 
resent them. 

When a test is being standardized for wide use through- 
out the country special care should be taken to see that in- 
structions can really be kept uniform and yet be universally 
adequate and universally just. In the first place instruc- 
tions should not require for their proper presentation 
material which some places may not have. Instructions 
should not require, for example, a blackboard unless a black- 
board is likely to be available wherever the test is to be 
given. Again, the instructions should employ neither words 
nor illustrations which have local significance only. When 
Woody could not find a universal term in use which meant 
an addition example, he secured universality by giving other 
terms in common use and suggested that examiners use the 
terms current in the grade or locality where the testing is 
being done. Again, an examiner once discovered that the 
standard instructions lacked sufficient universality because 
they failed to take into consideration the fact that some 
pupils are left-handed. Illustration of elements condition- 
ing universality could be multiplied. 

4. The Order of Instructions Should Be the Order 
of Doing. — It is probable that pupils can carry out in- 
structions with greater ease when the order of the instruc- 
tions is the order of doing. Long instructions are far more 



244 Ho"^ to Measure in Education 

tolerable when the steps in the description come in the same 
order as the steps of the process the pupil must go through. 
The demonstration is easier to imitate when the pupil does 
not find it necessary to transpose, in the process of doing the 
test, the steps observed in the demonstration. With our 
present meager knowledge and experience it may be unwise 
to hold to this principle absolutely and invariably, but there 
can be little doubt as to its general applicability. The fol- 
lowing instructions for an aviation test which was still in 
the research stage when the armistice was signed, will illus- 
trate this order of doing: 

INSTRUCTIONS FOR AVIATION OBSERVERS' TEST 

1. Seat the individuals to be tested in a well-lighted room and provide 

each with two pencils. Present the Observation Practice Chart 
directly in front of the subjects, on a level with, and fifteen feet 
from their eyes. 

2. Distribute an Observation Test Sheet to each candidate, and say: 

Write your name at the top of the sheet. 

3. When this is done, say: 

Imagine you are an army observer, and that the accuracy of the 
artillery fire will depend upon the accuracy of your observations. 
Among other things your test sheet will ask you to locate certain 
points, and to give the direction of certain points from certain other 
points. Watch while I show you how this is done, and how you 
are to record your answers. 

Look at question i on your test sheet. Suppose you were asked 
to locate J. You will note that according to the scale in the margin 
the center of J is 28 points to the right, and 4 points down. So the 
28 would be written under the word "Right" and the 4 under the 
word "Down." Look at question 2. Suppose you were asked to 
give the direction of 9 from, J . Imagine J to be the center of a 
circle which passes through 9. (Illustrate.) Consider that directly 
above J is degrees, directly east of J is 90°, directly south of J is 
j8o°, directly west of J is 270°, and so on back to 0° . It is clear 
that the center of 9 is somewhere between 180° and 270°. It is in 
fact at 244°. Remember that 0° is always at the north and that 
the degrees increase clockwise. Do not make the mistake of record- 
ing the degrees counter-clockwise. The other questions need no 
explanation. 

You will have 30 minutes, which allows less than a minute for 
each question. If you finish before time is called go back and try 
to improve your estimates. 

4. Replace the Practice Chart with the Observation Test Chart; set the 

watch 9 hrs., o min., and o sec, and say Begin! 

5. At 9 hrs., 4 min., o sec, say: Even though you have not finished ques- 

tion 10, begin now on question 11. 

6. At 9 hrs., II min., o sec, say: Even though you have not finished 

question 18, begin now on question ig. 



Organization of Test Material 245 

7. At 9 hrs., 20 min., o sec, say: Even though you have not finished 

question 27, begin now on question 28. 

8. At 9 hrs., 30 min., o sec, say: Stop! Pencils down! and quickly 

collect the test sheets. 
g. Immediately after collecting the papers, say: Remember the impor- 
tance of complete and accurate observations. You will have one 
minute to fi^rther study the chart, after which it will be covered, 
and you will be asked to compare what you have seen with what 
you see on another similar chart. 

10. Set the watch at 9 hrs., o min., o sec, and say Begin! 

11. ' At 9 hrs., I min., o sec, remove the Observation Chart, distribute 

Memory Comparison Test sheets, and say: Write your name at the 
top of each sheet. 

12. When names are written say: Follow the instructions at the top of 

the test sheet. You will have twenty minutes. If you finish before 
time is called go back and try to improve your judgments. Com- 
pare what you remember of the chart you have been studying with 
this new one. 

13. At 9 hrs., 5 min., o sec, present the new chart and say: Begin! 

14. At 9 hrs., 25 min., o sec, say Stop!. Pencils down! and quickly col- 

lect papers. 

5. Instructions Should Be Broken into Action 
Units, — The strain upon the pupil's memory is not 
nearly so great when the instructions are broken into action 
units. Wherever possible the pupil should carry out direc- 
tions before any other directions are given. The set of in- 
structions just given illustrates instructions which are broken 
into action units. The instructions which follow are not 
broken into such units. 

The experimenter holds the sheet before the class and says: ''This sheet 
contains some incomplete sentences, which form a scale. This scale is to 
measure how carefully and rapidly you can think and especially how good 
you are in your language work. 

"You are to write one word on each blank, in each case selecting the 
word which makes the most sensible statement. 

"You may have thirty minutes in which to sign your name at the top 
of the page and write the words that are missing. The papers will be 
passed to you face downward. Do not turn them over until we are all 
ready. After the signal is given to start, remember that you are to write 
just one word on each blank and that your score depends on the number 
of perfect sentences you have at the end of thirty minutes." 

It is easy to imagine just how little a pupil would remem- 
ber of the key points in the latter set of instructions after 
the excitement of passing papers, writing names and the like. 
When the order of instructions is the order of doing, and 
when the instructions are properly segmented by action, the 



246 How to Measure in Education 

instructions intimately concerned with each step of what 
the pupil is to do immediately precede that step. The pupil 
can give his undivided attention to that particular bit of 
instruction. When this principle is not satisfied the pupil is 
trying to grasp what is coming next and at the same time 
trying frantically to hold on lest what he has already heard 
escapes. 

6. Instructions Should Equalize Interest. — There 
are numerous factors besides interest which condition ability. 
Interest is dignified with special consideration because of its 
large effect upon the pupil's score. Interest determines 
effort. A pupil with high ability may show a range of in- 
terest from zero to high intensity, and hence a similar range 
of effort. 

Shall standardization be upon a high plane of interest or 
upon a low plane? And how shall the desired stratification 
be secured? Experimental results have not yet shown 
whether it is easier to equate interest on a low plane, medium 
plane or high plane. Hence general common-sense experi- 
ence must decide. Practical considerations rule out the 
offering of rewards high enough to secure the intensest pos- 
sible interest. Normal life interests vary so greatly that they 
cannot be taken as a criterion. The fact that tests are not 
so educative when taken with low interest as when taken 
with high interest tends to rule out attempting an equaliza- 
tion of interest on a low plane. Furthermore, performance 
on one test does not seem to agree so well with performance 
upon a duplicate test when interest is on a low plane. In 
the absence of reliable evidence, the best guess is that per- 
formance is more constant and is a better index of the ability 
being measured when interest is at the maximiun attainable 
by practicable methods. 

What motivation can be legitimately employed? Unless 
such will defeat the object of the test, the pupil should be 
informed of the general purpose of the test and when it is 
not perfectly obvious, of the general method by which he 
is to be scored. A pupil will be more interested who is told 



Organization of Test Material 247 

that the purpose of the test is to discover how rapidly he 
can read and then how accurately he can answer from 
memory questions upon what he has read, and hence his 
score will depend upon the number of seconds required to 
read a passage and the number of questions he can answer 
correctly upon what he has read. A detailed discussion of 
the purposes and methods of the test should not be attempted 
because of the necessity for brevity, and sometimes because 
of a necessity for concealing from the child the exact method 
of scoring. Secrecy is occasionally necessary in experi- 
mental work and in cases where the score is at the mercy of 
the pupil's honesty or lack of honesty. 

The behavior of the child and the testimony of adults 
bear eloquent witness to the potency of rivalry as a begetter 
of interest. Probably no stimulus at the disposal of the 
school is so powerful, natural and generally healthful. 

It is scarcely necessary to point out, however, that it will 
soon become impossible to secure interest through any 
method unless the pupils have an opportunity to learn how 
well they did on the test. 

Project testing offers another method of not only securing 
interest but of securing reality and naturalness as well, for 
here the test itself becomes a motive. In a recent study of 
the Hope Farm School, N. Y., C. C. Certain used the 
project method of testing composition and penmanship. 
The Hope Farm pupils were requested to write a letter to 
the Horace Mann School pupils. Certain had made arrange- 
ments for the delivery of these letters and for returning 
replies from the Horace Mann pupils. Each Hope Farm 
pupil was provided with paper, envelope and the name of a 
boy and girl in the Horace Mann. Project testing is excel- 
lent where it is possible to give the time and take the trouble 
necessary to make it a success. 

All the implications of the preceding paragraph have been 
that the device of securing interest by means of some form 
of rivalry is more artificial than project testing. We cannot 
be sure of this. Most of the games voluntarily selected by 



2 48 How to Measure in Education 

children and adults would never be selected for their own 
sake. With children as well as adults rivalry is itself a 
project and is intrinsically satisfying. Remove the contest 
feature and how long would men and women lay card on 
card, or men punch ivory balls into holes with a long slender 
stick, or would war even remain the engaging pursuit that 
it is and the greatest game of all? Interest through projects 
is excellent, but interest through rivalry is not always arti- 
ficial. 

7. Instructions to Pupils Should Be Accompanied 
by Instructions to Examiners, — Instructions to pupils 
should be accompanied by instructions to examiners telling 
how the test is to be applied, because it is a question which 
needs instructions more, pupil or examiner. Instructions to 
the examiner should be in steps easy to comprehend and 
follow. This easy use can be facilitated in two ways. First, 
the author of the instructions should formulate for the ex- 
aminer the exact words to say to the pupils and insert 
between various units of directions to pupils, the necessary 
directions to the examiner. And, secondly, when the instruc- 
tions to the examiner are inserted among instructions to 
pupils, the latter should be set off from the former in some 
convenient fashion. This can be done by numbering, para- 
graphing, underscoring, or italicizing the words to be said 
to the pupils. The sample set of broken instructions, given 
a few pages back, illustrates both the exact wording for 
pupils and a method of separating instructions to pupils 
from instructions to examiners. 



CHAPTER IX 
SCALING THE TEST 

I. Percentile Scale — Percentile Unit 

Why Scale Tests? — The fundamental aim of all test- 
ing is to reveal correct differences between pupils or groups 
of pupils. To reveaL correct differences, a test must not 
only be valid but must possess, among others, the following 
characteristics. (Crude or exact scaHng is prerequisite to 
each of the following traits.) 

I. Every pupil should make some score larger than 
zero. If every pupil makes a zero score it is utterly impos- 
sible to tell which is the best, average, and stupidest pupil. 
If only one pupil out of a class makes zero, there is no way 
to determine just how much more stupid he is than the rest 
of the pupils. Zero-score pupils are unmeasured. The 
range of ability in a class is usually very great so, if the 
least able pupils are to make a score, the first elements of 
all difficulty tests must be within the ability of the least able 
and hence far easier than would be required for the abler 
pupils. The criterion requires that rate tests also must be 
composed of test elements whose difficulty is within the 
ability of all the pupils and which give a sufficiently long 
time limit. In an initial test in a recent Ph.D. research 
several pupils in one of the experimental groups made zero 
scores. In the final test, some time after, they made scores 
above zero. The conditions of the research required that 
the amount of their improvement be known. How much 
did the pupils improve? Nobody knows. It may have been 
and probably was a small amount, but it may have been 
enormous. 

249 



250 How to Measure in Education 

2. No pupil should make a perfect score. Perfect-score 
pupils are unmeasured just as zero-pupils are unmeasured. 
In the case of perfect scores it is not known how much 
better the pupils are and in the case of zero scores it is not 
known how much worse they are. 

3. There should be no undistributed scores, whatever. 
A test often yields undistributed scores when there is not 
a single zero or perfect score, and these may occur any- 
where between the lowest and highest scores inclusive. 
These undistributed scores are produced by coarse scoring. 
The coarsest possible method of scoring is the ^'all or none" 
method. To score pupils on a test as either ''passed'^ or 
"failed" is an example of the ''all or none" method, and 
gives very undistributed scores, for so far as the scores indi- 
cate all who receive a pass are exactly alike. 

How fine should the scoring for a test be? The fineness 
of the scoring depends upon the uses to be made of the 
results. The following, however, will serve as a rough 
general rule: Construct tests which will separate the pupils 
into at least seven groups of ability, and not less than thir- 
teen if the data are to be used for correlation. The above 
numbers, seven and thirteen, are minimum numbers. The 
finer the grouping the better. If the pupils are separated 
into less than seven groups of ability the results will have 
very limited uses, and if less than thirteen the influence of 
coarse scoring upon the coefficient of correlation will not be 
negligible. Among difficulty tests that one provides best 
against any sort of undistributed scores where the first test 
element is easy enough for all the pupils and where each suc- 
ceeding element progressively increases in difficulty by 
small steps to a point beyond the ability of the ablest pupil. 
A very fine scoring of a few test elements will, however, 
produce the same effect as increasing the number of test 
elements. Stone, for example, has but twelve problems in 
his Reasoning Test in Arithmetic, yet it is possible to sep- 
arate pupils into more than twelve groups by means of his 
test by giving credit for only parts of problems correct. 



Scaling the Test 251 

Each can easily figure out for himself just how to avoid 
undistributed scores in rate tests. 

In the foregoing discussion of undistributed scores it has 
been assumed that each examiner will desire a score for each 
pupil, for unless such scores are secured a test cannot serve 
its most vital functions. In case only a class score is desired 
a few undistributed zero and perfect scores would do little 
or no harm if the median of pupil scores or the per cent of 
pupils doing a certain test element be the method of com- 
puting class scores. If, on the other hand, the mean of 
pupil scores is taken as the class score undistributed ex- 
tremes may seriously effect the size of the class score. 

4. The test should be scaled and the standardized method 
of scoring should utilize these refinements of the scaling. 
The scaling is a little more useful if the scale distance from 
one test element to the next is exactly equal to the scale 
distance between any two adjoining elements, i. e., if the 
scale progresses by equal steps or units of difficulty. The ex- 
actness of the scaling conditions the exactness with which 
differences between pupils can be measured. 

5. A corrollary of the preceding paragraph is that a test 
should yield a statistical result. All measurement in descrip- 
tive words should give place to mathematical statement. 
Supervisors, for example, frequently rate teachers without 
developing any statistical system of recording and combining 
their ratings. It is mainly in the realm of subjective esti- 
mates that non-statistical measuring occurs. Recently an 
experiment was undertaken to determine by means of stand- 
ard tests just how accurately supervisors could estimate the 
efficiency of certain teaching methods. When the time came 
to compare test results with the judgment of the supervisors, 
no worthwhile computations could be made, for the super- 
visors had not kept any statistical records. 

6. Finally, correct differences cannot be revealed unless 
the two scores yielded by each rate test be reducible to a 
common denominator. Consider this situation from the 
Courtis Addition Test. Pupil A makes a speed score of 10 



2 52 How to Measure in Education 

and an accuracy score of 90% while Pupil B makes a speed 
score of 12 and an accuracy score of 75%. Which pupil 
has made the better showing? As long as speed or accuracy 
is left free to fluctuate up and down in a sort of see-saw man- 
ner, no satisfactory comparisons between scores can be made 
until a table has been prepared for transmitting all scores to 
a constant speed or constant accuracy. Such a proposed 
table would tell us what would have been the accuracy of 
Pupil B had he worked at a speed of 10 examples instead 
of 12. Courtis^ has originated a formula whereby speed 
and accuracy scores on his Arithmetic Tests, Series B, can 
be converted into a single score. 

Perhaps the quickest method of determining the accuracy 
equivalence of a given amount of speed would be to adjust 
the weighting assigned until there has been secured the 
highest obtainable self-correlation between scores from two 
applications of the same rate test to the same pupils, when 
the scores correlated represent a combination score for 
both speed and accuracy. 

There are numerous methods of scaling tests. First, there 
is the goal scale used by Courtis in connection with the 
Courtis Supervisory Tests. Any pupil whose score on the 
test falls between, say, 20 and 25 words spelled correctly on 
a particular spelling test is considered to have attained an 
appropriate spelling goal and is scored 1000. Any pupil 
who falls between, saj^, 17 and 20 is scored 500 and so on 
down to zero. Second, there is the frequency-of-occurrence 
scale. In the Jones' Vocabulary Test the score a pupil re- 
ceives for knowing a certain word depends upon the fre- 
quency of that word's appearance in ten primers. In simi}ar 
manner the degree of an individual's emotional aberration, as 
determined by the Kent-Rosanoff Free Association Test, is 
measured by the rarity of the individual's responses to the 
test. Then there is the percentile scale, age scale, grade 
scale, product scale, and T-scale. The last five are the ones 

^ S. A. Courtis and E. L. Thorndike, "Correction Formulae for Addition Tests"; 
Teachers College Record, January, 1920. 






Scaling the Test 253 

more commonly used. These five only will be discussed in 
detail. 

How to Construct a Percentile Scale. — Table 17 
shows in the first column the number of questions in the 
Thorndike-McCall Reading Scale^ Form i. In the second 
column is shown the number of eleven-year-old pupils 
answering correctly a given number of questions. In the 
third column is shown the percentile corresponding to a 
given number of questions correct in the first column. The 
first and third columns constitute a percentile table for the 
eleven-year-olds for this particular reading scale. In similar 
fashion a percentile table could be prepared for this scale for 
any age or for any scale for any age, or, if desired, for this 
or any scale for various ages combined. 

How was this percentile table constructed? In a later 
chapter will be found a detailed description of the computa- 
tion of Qi (25 percentile), Q2 (50 percentile or median), 
and Q3 (75 percentile). All other percentiles are computed 
in similar manner. When scores are arranged in order of 
size and when counting is done in the direction of low to 
high, the 2 5 percentile is that score which is found by count- 
ing through one-fourth of the scores. The 50 percentile is 
found by counting through one-half of the scores. The 75 
percentile is found by counting through three- fourths of the 
scores. Similarly the 10 percentile is found by counting 
through one-tenth of the scores, the 20 percentile by count- 
ing through two-tenths of the scores, the 35 percentile by 
counting through three-and-one-half tenths of the scores, 
and similarly for any other percentile. The lowest and 
highest scores may be considered the zero and looth per- 
centiles respectively. 

Such is the procedure for counting percentiles directly. 
Percentiles may also be computed through the use of a table. 
When computed in the latter fashion, the percentiles do not 
so much represent what was actually found but what pre- 
sumably would have been found had a very, very much 
larger number of pupils been tested. 



2 54 



How to Measure in Education 



TABLE 17 

Shows the Construction of a Percentile Table for Eleven-Year-Old 
Children for the Thorndike-McCall Reading Scale, Form i 



Questions Correct 


Number of Pupils 


Percentile 










I 






2 






3 






4 






5 






6 


I 




7 


5 




8 


4 




9 


2 




ID 


6 




II 


4 




12 


3 




13 


4 




14 


12 


10 


15 


15 




16 


22 




17 


31 


20 


18 


20 


30 


19 


32 




20 


42 


40 


21 


35 


50 


22 


40 


60 


23 


32 


70 


24 


29 


80 


25 


22 




26 


16 




27 


16 


90 


28 


13 




29 


3 




30 


4 




31 


6 




32 







33 


I 


100 



How to Interpret Percentile Scores. — Table 18 shows 
how to use percentile tables to interpret the scores of an 



Scaling the Test 



255 



eleven-year-old pupil. Table 18 gives a percentile table for 
eleven-year-old pupils for each of three different tests. Test 
A is the Thorndike-McCall Reading Scale, Form i, and the 
percentiles are taken from Table 17. Test B is an arith- 
metic test. Test C is a spelling test. The pupil made a 
score of 22 on Test A which is equivalent to a percentile of 
60. He made a score of 39 on Test B which gives him an 
arithmetic percentile of 55, because a score of 39 is half 
way between 38 and 40 whose percentiles are 50 and 60. 
His spelling percentile is as shown, 50. Since a score of ten 
corresponds to three different percentiles, 40, 50, and 60, it 
is best to take the middle one. Had the pupil been ten 
years old instead of eleven, the percentile tables for ten- 
year-olds should be used in place of the percentile tables 
for eleven-year-olds. 

TABLE 18 

Shows How to Interpret the Scores on Three Tests for an Eleven-Year-Old 
Pupil by Means of Percentiles 



TEST 


PERCENTILE TABLES 


PUPIL'S 
SCORES 


PUPIL'S 
PERCEN- 
TILES 


10 20 30 40 50 60 70 80 90 100 


Test A 
Test B 
Test C 


14 17 18 20 21 22 23 24 27 33 

1 25 30 34 36 38 40 43 46 50 62 
4 6 7 9 10 10 10 II 13 17 24 


22 

39 
10 


60 

55 
50 


MEDIAN 


55 



The pupil's true percentile position on the three tests may, 
for most practical purposes, be taken as the median of his 
three percentiles, namely, 55. This percentile shows him to 
be slightly above the average for his age. 

Theoretically, however, a pupil's mental index is not truly 
represented, when two or more tests are used, by the median 
or mean of his various percentiles on these tests. Pintner ^ 
used the per cen tile-scale method for his mental survey tests. 
In an admirable discussion of percentiles he points out that 
an additional step is necessary. Due to the fact that in- 

2 Rudolf Pintner, The Mental Survey; D. Appleton & Co., New York, 1918. 



256 How to Measure in Education 

creased adequacy of measurement decreases variability it is 
necessary to have a super-percentile table which shows the 
true percentile value of various median percentiles. This 
super-percentile table is constructed just like a percentile 
table for a single test. The median percentiles for a large 
number of pupils of a given age take the place in the first 
column of Table 17 of the number of questions correct. 

II. Age Scale — Grov^th Unit 

How to Construct an Age Scale. — The construction 
of an age scale merely requires the determination of satis- 
factory age norms. The percentile units tell just how a 
pupil compares with all the other pupils of like age, if age is 
the basis, or like grade, if grade is the basis. Instead of 
interpreting a score for a ten-year-old pupil as the 25, 50 
or 75 percentile of ten-year-olds, we could say that this 
ten-year-old pupil has a score equal to the average score for 
eleven-year-olds, or twelve-year-olds or any age above or 
below the age of the pupil in question. We could score 
the pupil 8, 10, 1 1.5, etc., meaning that he was respectively 
the equal of average eight-year-olds, average ten-year-olds, 
or half way between average eleven-year-olds and average 
twelve-year-olds. To do this would be to use what I have 
called the growth unit. Given a norm for each age, any 
pupil's test score may be readily transmuted into educa- 
tional age and Educational Quotient (E.Q.) if the test is 
an educational test, or into mental age and Intelligence 
Quotient if the test is an intelligence test. Since the process 
is identical in both cases the transmutation is illustrated in 
Table 19 only for some educational tests. 

How to Interpret Age Scores. — Table 19 is inter- 
preted viz.: On Test A the average score is 4 for 8-year- 
olds, 8 for 9-year-olds and so on. The pupil in question 
made a score of 12. Since 12 is exactly the average score 
for lo-year-olds, the pupil's age score is 10. And since the 
pupil's chronological age is also 10 years, his Educational 



Scaling the Test 

TABLE 19 



257 



Shows the Use of Growth Unit and Educational Quotient (E.Q.) 
for Interpreting Educational Test Scores of a 10- Year-Old 
Pupil 



Test 


Age 


Pupil's 

Test 

Score 


Pupil's 

Age 

Score 


Pupil's 


8 9 10 II 12 13 


Kg. 


A — Av. Score 
B — Av. Score 
C — Av. Score 


4 8 12 15 18 20 
20 30 38 45 50 55 
85 84 82 80 78 75 


12 
46 
83 


10 
11.2 

9-5 


100 

112 
95 


Median 


10. 


100 



Quotient (E.Q.) is 100, computed thus: [(10 -^- 10) X 
(100)]. An E.Q. of 100 means that the pupil is at age or 
exactly normal in the ability measured by Test A. The 
pupil's test score on Test B is 46, which is one-fifth or .2 of 
the distance from norms 45 and 50. Consequently the 
pupiPs age score is 11.2, and his E.Q. is 112. Test C is 
scored in time units, hence the larger the score is, the less 
the abihty is. The pupil is one-half or .5 of the way be- 
tween average scores 84 and 82 which are the norms for 
ages 9 and 10 respectively. Hence the pupil's age score is 
9.5 and his E.Q. is 95. The medians 10 and 100 tell us 
that in the median of all tests the pupil is a normal indi- 
vidual. Were this pupil 10 years and 6 months old chrono- 
logically, or 10.5 years old, his E.Q. for Test A would be 
95.2 computed thus: [(10 -^ 10.5) X (100)], his E.Q. for 
Test B would be 106.6 computed thus: [(11.2 -^- 10.5) X 
(100)], his E.Q. for Test C would be 90.5 computed thus: 
[(9-5 -^ 10-5) X (100)]. Sometimes it will be more con- 
venient to reduce the age in years and decimals of years 
to months before computing E.Q. Thus the last E.Q. of 
90.5 may be computed thus: 9.5 yrs. ==114 mos., 10.5 
yrs. = 126 mos. [(114 -^- 126) X (100)] = 90.5. The 
result is the same either way. 

There are certain age scales which do not require the 
transmutation illustrated in Table 19. The Stanford Re- 



258 How to Measure in Education 

vision of the Binet-Simon Intelligence Scale is an illustration 
of such a scale; whereas the Buckingham-Monroe Illinois 
Intelligence Examination is an illustration of one which 
does require transmutation. The Stanford Revision of the 
Binet-Simon Scale is so constructed that two months of 
age is allowed for every test element done correctly. Hence 
the original score comes out as the age score. 

III. Grade Scale — Grade Variability Unit 

Derivation of Grade Scale. — The scoring unit for 
time is the hour, for wealth is the dollar, for weight is the 
pound, for temperature is the degree, for distance is the 
foot and so on. The fundamental and most universal scor- 
ing unit for measuring educational achievement is the P.E., 
S.D. or some other measure of the variability of pupil per- 
formance. The nature of this scoring unit is discussed in a 
later chapter. At this point it will be necessary only to 
consider how this unit is utilized in scale construction. The 
technique of grade scale construction is described in detail 
in Woody's The Measurement of Some Achievements in 
Arithmetic, Bureau of Publication, Teachers College, N. 
Y. C. A brief summary of the steps involved in this method 
of scale construction will be sufficient for our purpose. To 
date, this method has been applied to grades rather than 
ages. 

Suppose an examiner wishes to make an addition scale for 
Grade III. The steps are, viz.: 

1. He uses his judgment to select a large number of 
addition examples, gradually varying in difficulty from the 
easiest possible example to very difficult examples. 

2. He tries out these examples upon a large number of 
chosen-at-random, third-grade pupils. 

3. He computes the per cent of pupils correctly solving 
each example. If 90% solve a given example, obviously this 
example is very easy and hence has a low scale value. If 
50% solve a given example, that example is of average diffi- 



Scaling the Test 



259 



culty and occupies a medium high position on the scale. If 
only 5 % solve a given example, this example is very difficult 
and occupies a high position on the scale. Thus an ex- 
ample's position on the addition scale is determined by the 
per cent of pupils correctly solving it. The larger the per 
cent is, the less the difficulty, and the lower the example's 
position on the difficulty scale. 

4. By means of Table 20, he converts these per cents 
into P.E. units of difficulty or P.E. distances above and 
below the third-grade median. 



TABLE 20 

Showing the P.E. Values above or below the Median Corresponding to the Per 
Cent of Pupils Correctly Doing a Given Test Element. Before Reading, Sub- 
tract the Per Cent Correct from 50% Examples: 83.60% did test element A. 
V 50% — 83.60% = — 33.60% = 1.45 P.E. Below Median. 2.15% did test 

element B. 50% — 2.15% = 47.85% = 3.0 P.E. above the median. 



X 

P.E. 


.00 


.05 


X 

P.E. 


.00 


.05 


X 

P.E. 


.00 


.05 


X 

P.E. 


.00 


.05 





0000 


0135 


1.5 


3441 


3521 


3.0 


4785 


4802 


4-5 


4988 


4989 


.1 


0269 


0403 


1.6 


3597 


3671 


3-1 


4817 


4831 


4.6 


4990 


4991 


.2 


0536 


0670 


1.7 


3742 


3811 


3-2 


4845 


4858 


4-7 


4992 


4993 


.3 


0802 


0933 


1.8 


3896 


3939 


3.3 


4870 


4881 


4.8 


4994 


4994.6 


.4 


1063 


1193 


1.9 


4000 


4057 


3.4 


4891 


4900 


4-9 


4995.2 


4995.7 


.5 


1321 


1447 


2.0 


4113 


4166 


3.5 


4909 


4917 


5.0 


4996.2 


4996.6 


.6 


1571 


1695 


2.1 


4217 


4265 


3.6 


4924 


4931 


5-1 


4997.1 


4997.4 


.7 


1816 


1935 


2,2 


43 1 1 


4354 


3.7 


4937 


4943 


5-2 


4997-7 


4998.0 


.8 


2053 


2168 


2.3 


4396 


4435 


3.8 


4948 


4953 


5.3 


4998.2 


4998.4 


.9 


2291 


2392 


2.4 


4472 


4508 


3.9 


4957 


4961 


5.4 


4998.6 


4998.8 


I.O 


2500 


2606 


2.5 


4541 


4573 


4.0 


4965 


4968 


5.5 


4999.0 


4999.1 


I.I 


2709 


2810 


2.6 


4602 


4631 


4.1 


4971 


4974 


5.b 


4999.2 


4999.3 


1.2 


2908 


3004 


2.7 


4657 


4682 


4.2 


4977 


4979 


5-7 


4999-4 


4999-5 


1.3 


3097 


3 1 88 


2.8 


4705 


4727 


4.3 


4981 


4983 


5.8 


4999-55 


4999-6 


1.4 


327s 


3360 


2.9 


4748 


4767 


4.4 


4985 


4987 


5-9 


4999.65 


4999.7 



Suppose the per cent of pupils working correctly examples 
A through G were as in Table 21. Table 21 shows how to 
use Table 20 for converting per cents correct into P.E, dis- 
tances above and below the median of Grade III. 



TABLE 21 

Showing the Use of Table 20 for Converting Per Cents Correct Into P£. 

Distances from the Median 

Examples A 

Per cent correct 95. 

Subtract from 50 per cent — 45. 

P.E. Distances — 2.45 



B 


C 


D 


E F 


G 


85. 


60. 


SO. 


40. 12.2 


2.2 


— 35. 


— 10. 


00. 


10. 37-8 


47-8 


— I.S5 


— 0.4 


00. 


Q.4 1.75 


3.0 



26o How to Measure in Education 

The difference in difficulty between Example A and 
Example B is (95 — 85) or 10% which is (2.45 — 1.55) or 
.90 P.E. The difference between C and D is also 10%, 
but the P.E. difference is only .4 or less than half the differ- 
ence between A and B. The difference between D and E 
is also 10% and the P.E. difference is .4. The difference 
between F and G is likewise 10% and the P.E. difference is 
1.25. Obviously, differences in per cent tell us almost noth- 
ing about the differences in difficulty. The real difference in 
difficulty is shown not by the per cents but by the P.E. 
values. The C D and D E differences are equal both in per 
cents and P.EJs. This is because D is at the median while 
C and E are at like positions below and above D. The F G 
difference is the largest because these are the most extreme 
per cents. Example A has the largest per cent correct and 
yet it is farthest below the median. This is because Table 
21 shows a scale of difficulty. That example which the 
largest per cent of pupils work correctly is the easiest 
example. 

5. He determines the P.E. distance of the zero point of 
addition ability from the third-grade median. ■ In locating 
the zero point, the point of view changes. We no longer 
enquire what per cent of pupils worked any one example 
but what per cent of the third-grade pupils did not succeed 
in working a single example. Suppose that 4% failed to 
work a single example. According to Table 20 this is 2.6 
P.E. below the third-grade median and is the zero point. 

6. He determines how many P.E. each example is above 
the zero point, and the scale is finished. The P.E. value of 
each example above zero is shown below, and was com- 
puted by algebraically subtracting the distance of the zero 
point below the median, namely, — 2.6 P.£., from the P.E. 
value of each example as shown in Table 2 1 . 

Examples A B C D E F G 

P.E. Value above Zero... .15 1.05 2.2 2.6 3.0 4.35 5.6 

7. Sometimes he eliminates from the scale those ex- 
amples which do not fall at equal P.E. intervals. Woody's 



Scaling the Test 261 

Arithmetic Scales, Series B, represent such a selection from 
his longer Series A and gives a scale progressing by approxi- 
mately equal steps. 

8. If the addition scale is being constructed for the 
entire elementary school instead of for only one grade, he 
repeats steps 2, 3 and 4 for each elementary school grade as 
has just been done for Grade III. 

9. He computes the P.E. distance from each grade 
median to the adjoining grade median or medians. This 
is done by computing the per cent of pupils in one grade 
having scores which are larger than the median of the 
adjoining grade. This per cent, when read in Table 20, 
shows the P.E. distance betv^een the medians of the two 
grades. The two possible direct and several possible indi- 
rect determinations are averaged to get a truer P.E. distance. 

10. By the use of these intervals, he converts the P.E. 
distance of each example from its own grade median into 
P.E. distances above the common zero point of reference. 

11. He then determines the final elementary-school P.E. 
value of each example and hence its position on the scale by 
averaging its P.E. values in the different grades. In mak- 
ing this average, the values in certain grades are usually 
given more weight than the values in other grades, for the 
reason that P.E. values which fall near the middle of a grade 
distribution are more reliably determined than when they 
fall near the extremes. The method of weighting is de- 
scribed in Woody's The Measurement of Some Achievements 
in Arithmetic. 

12. When an elementary-school scale alone is desired 
much labor can be saved by computing from the outset the 
per cent of pupils in all the grades combined who solve each 
example correctly. In this way only one P.E. value of each 
example is read from Table 20. This value is referred to a 
zero determined from the per cent of all the pupils who 
make no score, rather than from the per cent of the second 
or third grade, and the scale is finished. 

This short method assumes that ability in Grade III 



262 How to Measure in Education 

through VIII combined is distributed according to the nor- 
mal frequency surface. This is not a specially violent 
assumption. The curve of ability for all these grades com- 
bined would, however, be slightly flat-topped, especially if 
Grades II and III were included. 

13. Still another short method would be to use the scale 
determined for the sixth grade as the elementary school 
scale. Standing as it does about mid-way between the fourth 
and eighth grade, results from it could be considered typical 
of the whole school. 

Constancy o£ the P.E. in Grade Scales. — If we were 
to measure carefully the length of a bar of iron in inches 
and were to perfectly preserve the bar and measure it again 
in 200 years, the bar would still be the same number of 
inches in length. The fifth example on Woody's Addition 
Scale, Series B, has a P.E. value of 3.26. Will this example, 
3 + I = , have a value of 3.26 P.E. 200 years from now? 

There are at least two forces operating to cause a fluctua- 
tion from time to time in the value of an example, or any 
other test element for that matter. There are, first, changes 
in courses of study, efficiency of instruction and the like. 
Probably the chief difficulty in the above example was the 
plus sign. The pupils found it easier to add 72 and 26 
when the 26 was written under the 72 than to add 3 + 1. 
The very fact that this simple example proved so difficult 
will undoubtedly attract the notice of teachers and a special 
effort will be made to familiarize pupils with signs early in 
their school career. Courses of study may be altered to 
bring instruction in the four algebraic signs earlier than at 
present. The result would be a rearrangement of the prob- 
lems in the scale. Any change that was partial or otherwise 
to a particular test element would upset its present P.E. 
value. 

A second factor is an alteration in the classification of 
pupils. The use of tests is gradually improving the accuracy 
of classification, which in turn is reducing the amount of 
overlapping of ability between grades, which in turn is in- 



Scaling the Test 263 

creasing the interval between grades and influencing the 
form of distribution, all of which is bound to influence the 
P.E. values of various test elements. Again, the grade, 
always a rather artificial feature, is becoming more or less 
meaningless. Some schools have reduced the eight grades 
to seven, others have eliminated grades by a departmen- 
talization, others have crossed grade lines with section 
classification on the basis of educational and mental tests. 
It is becoming increasingly probable that future test con- 
structors will prefer age to grade as a basis for scale con- 
struction. 

III. Product Scale — Variability-of-Adult- 
Performance Unit 

Ayres' Handwriting Scale. — So far as I recall, 
Ayres' ^ Handwriting Scale is the only scale which makes use 
of this variability-of -adult-performance unit for scoring the 
achievement of pupils. Most handwriting scales determine 
values on the scale by the judgment of adults rather than 
by the achievement or performance of adults. A quotation 
from Ayres will bring out very clearly the relationship be- 
tween his scale and Thorndike's scale and will show his 
reasons for adopting a different scoring unit. 

"The Thorndike Scale 

"This (Ayres' Scale) is not a pioneer piece of work in 
this field, although it is different in method from anything 
of the sort previously attempted. The credit of developing 
the first measuring scale for handwriting belongs to Pro- 
fessor Edward L. Thorndike of Teachers College, Columbia 
University. The publication, in March, 1910, of his hand- 
writing scale constituted a most important contribution not 
only to experimental pedagogy but to the entire movement 
for the scientific study of education. 

"As Professor Thorndike says in his introduction, previ- 

^ Leonard P. Ayres, A Scale for Measuring the Quality of Handwriting of 
School Children, No. J13; Russell Sage Foundation, N. Y, C, 



264 How to Measure in Education 

ous to that time educators were in the same condition with 
respect to handwriting as were students of temperature be- 
fore the discovery of the thermometer. Just as it was then 
impossible to measure temperature beyond the very hot, 
warm, cool, etc., of subjective opinion, so it had been 
impossible to estimate the quality of handwriting except by 
such vague standards as one's personal opinion that given 
samples were very bad, bad, good, very good, etc. Pro- 
fessor Thorndike's scale for the handwriting of children 
is based on the average or median judgments of some 23 to 
55 judges who graded samples of writing into groups by 
what they considered equal progressive steps in general 
merit. 

"Legibility as a Criterion 

"The method by which the present scale has been pro- 
duced, and the criterion on which it rests as a basis, differ 
radically from those adopted by Professor Thorndike. The 
difference in the bases is that in the present case legibility 
has been adopted as a criterion for rating the different 
samples in place of 'general merit' used as the basis of 
Thorndike's scale. This change substitutes function for 
appearance as a criterion for judging handwriting. 

"There are two arguments for adopting the new criterion. 
In the first place the prime importance of writing is to be 
read, and hence it has seemed worth while to adopt 'reada- 
bility' as the basis for rating samples of handwriting. In 
the second place legibility possesses the advantage of being 
measurable in definite quantitative units through finding the 
amount of time required to read with a given degree of 
accuracy, a given amount of matter in the handwriting being 
studied. The criterion of general merit is not susceptible of 
any such evaluation. 

"The method whereby the new scale has been produced 
differs from the method employed in producing the previous 
scale in that it is based on the distribution of the recorded 
time required on the average by a number of readers to read 
the samples of writing, rather than on the average of their 
judgments concerning what they considered equal steps in 
general merit." 



Scaling the Test < 265 

Having, with proper experimental precautions, deter- 
mined the average time required by 10 individuals to read 
each of 1578 specimens of handwriting collected from the 
upper grades of the elementary school, he plotted all these 
rates into a frequency surface. The base line of the fre- 
quency surface was divided into ten equal intervals, the 
lowest rate being called o, the highest rate being called 100, 
and the mean rate on all specimens being labeled 50. The 
samples of handwriting having rates corresponding to the 
points 20, 30, . . . 100, were located and printed as a scale. 

The scale is used viz.: A pupil's handwriting is moved 
along the scale until a writing of the same quality is found. 
The pupil's specimen is scored 20, 30, 40, or above accord- 
ing to the value of the scale specimen with which it corre- 
sponds in quality. 

The reader's attention is called to the following points 
and questions: First, legibility and not ^'general merit" is 
the criterion for this scale. Is legibility the only important 
element in the world's practical evaluation of handwriting? 
Second, a pupil's handwriting is scored by a judgment com- 
parison with the scale. Does this final use of judgment 
make void the legibility criterion? 

IV. Product Scale — Variability-of-Judgment Unit 

Derivation of Product Scales. — Hillegas' English Com- 
position Scale, Starch's Handwriting Scale and Thorndike's 
Drawing Scale are typical instances of pure product 
scales. All were constructed in the same manner, which was 
as follows: 

1. The scale constructor selects many specimens of, say, 
composition which vary by small amounts from compositions 
of zero merit up to, say, the highest quality of composition 
produced by the best authors. 

2. He asks many presumably competent judges to 
arrange the compositions in order of merit and also to desig- 
nate the specimen which is, in their judgment, of just zero 
merit. 



266 How to Measure in Education 

3. He computes from these rankings the per cent of 
judges who rated specimen A better than specimen B, better 
than specimen C and so on. Then he computes the per cent 
of judges who rated specimen B better than specimen C, 
better than specimen D and so on. He continues this 
process until he has a table showing the per cent of judges 
who rate each specimen better than every other specimen. 
The per cent of judges rating a very poor specimen better 
than a very good one is likely to be zero, while the per cent 
rating the specimen of high merit better than the specimen 
of low merit is likely to be 100. Per cents of better judg- 
ments will range all the way from zero to 100. 

4. He subtracts 50 per cent from all the above per cents. 

5. He determines the P.E. difference in merit between 
each specimen and every other specimen by looking up these 
remainder per cents in Table 20. Table 22 illustrates the 
process. 

TABLE 22 

Shows How to Use Table 20 to Convert Per Cent of Better Judg- 
ments into P.E. Differences in Merit Between Composition 
Specimens A, B, C, etc. 

Specimens A>TN>AB>N K>B E>K L>E 

Per cent 50 75 75 84.41 64.47 9i-i3 

Per cent minus 50 00 25 25 34-41 14-47 4i-i3 

P.£. Difference . . 00 i.o i.o 1.5 .55 2.00 

6. He not only determines the P.E. difference AB, AC, 
AD, etc., and BC, BD, BE, etc., directly, but he determines 
these differences in many indirect ways as well. Thus, for 
example, the distance N A, above, equals T N minus T A, 
the distance B N equals A B minus A N, the distance L E 
equals [(T L) minus (TA + AN + NB + BK + 
K E) ] . There are many other indirect ways of determining 
the P.E. difference between any two specimens. 

7. The mean of all possible direct and indirect deter- 
minations of P.E. differences is computed to get the true 
difference. The greater the indirectness the less the 



Scaling the Test 267 

weight given to the determination in computing this mean 
P.E. difference. 

8. He arranges the specimens in order of merit record- 
ing the P.E. distance each is above the preceding one, thus: 

Specimen T A N B K E L 

P.E. Distance o i.o i.o 1.5 .55 2.0 

9. He records from the original data the number of 
judges indicating each specimen as of just zero merit. Some 
will indicate, say, K, some B, some N, some A, some T, and 
some specimens which are below A and T in merit. 

10. He computes the median zero specimen. Let us 
suppose that the median specimen is found to be A. 

11. He computes the P.E. distance each specimen is 
above the zero specimen and calls this its scale value. Since 
A and T are of equal merit the scale becomes as follows: 

Specimen AorTN B K E L 

Scale Value.... o 1.0 2.0 3.5 4.05 6.05 

12. Beginning with the zero specimen he selects others 
above it such that distances between specimens will be about 
I P.E. Smaller scale steps are probably not desirable for 
scales which are to be widely used because a difference of 
I P.E. is a difference which only 75 out of 100 judges can 
see. Smaller-step scales may be valuable for scientific work, 
or for use by individuals who are specially expert in detect- 
ing subtle differences in merit. When two or more specimens 
have approximately the same scale value they may all be 
presented in order to give a wider range of composition type, 
or that one may be selected which shows the least disagree- 
ment among judges. The Thorndike Extension of the 
Hillegas Scale adopted the former method and the Nassau 
Extension of the Hillegas Scale the latter. 

Validity and Constancy of Judgment Units. — What is 
the validity of this P.E. as a unit of measurement? Product 
scales were made possible by the formulation of the now 
famous Cattell- Fuller ton theorem, and by the ingenious 
application by Thorndike of this theorem in the construction 



268 How to Measure in Education 

of educational scales. Courtis has reported an experiment 
which was conducted to test the validity of this basic 
theorem; namely, differences which are equally often noticed 
are equal unless they are always noticed or never noticed. 
Courtis wanted to know whether differences which are 
equally often noticed really are equal. To test this he 
made a product scale of areas instead of compositions or 
specimens of handwriting. After determining the differ- 
ences between areas of variously shaped figures by means 
of judgments, he determined the differences by actual 
measurement. The differences as determined by judgments 
followed the principle of Weber's law, i. e., when the area 
was small, a slight increase or decrease in area could be 
seen; when the area was large, a considerable change of 
area was necessary in order that judges might be able to 
notice the difference. In other words, equally often noticed 
differences were equal for areas of about the same size only. 
The theorem does not hold for widely separated areas. 
Does it hold for specimens of penmanship widely separated 
in merit? Presumably it does not, if there are absolute 
differences in merit of handwriting in the same sense that 
there are absolute differences in area. 

Even if this last is true, we need not lose confidence in 
our product scales. Education is interested in many kinds 
of differences. It would be valuable to know them all. 
There are absolute differences such as Courtis points out. 
There are difficulty differences, and this is the kind of differ- 
ence Woody's arithmetic scales bring out. Product scales 
measure judgment differences. The value on percentile, age 
and grade scales are determined by how difficult pupils 
actually find the test elements. These scales could be con- 
verted into product scales by determining the difficulty of 
each test element, not by the achievement of the pupils, but 
by the opinion of adults. This has not often been done 
simply because education is far more concerned with how 
difficult test elements actually are than how difficult some- 



Scaling the Test 269 

body thinks they are. But in the realm of composition, 
handwriting and the Hke, we are not primarily concerned 
with difficulty but with merit, and we are less concerned 
with an absolute merit than we are with the merit as deter- 
mined by the opinion of competent judges, in the way that 
competent judges practically operate outside or inside the 
schools. 

Is the judgment scoring unit constant? The meter was 
originally defined as one ten-millionth of the distance from 
the pole to the equator. Alteration of this distance through 
the centuries due to the contraction or expansion of the 
earth would, of course, alter the meter, especially if a re- 
determination became necessary because of the loss of the 
meter bar carefully preserved at Paris. Alteration of this 
distance due to the subjectivity of the determiner would 
also alter the meter. As a matter of fact no two determina- 
tions of the pole to equator distance have turned out to be 
exactly the same. Consequently the meter is now measured 
in terms of so many wave lengths of a certain radiation. 
What forces are operating to produce an inconstancy of 
P.E. in, let us say, a composition scale? 

Only the two most likely forces need be mentioned. First, 
it is possible to discriminate finer shades of composition 
merit. There is certainly room for improvement in this 
respect. So far as most of us are concerned there is "low 
visibility" when it comes to evaluating composition merit. 
The effect of a more microscopic eye would be to make P.E. 
smaller than it is at present. Second, it is possible that 
future judges will have a different opinion from present- 
day judges as to what constitutes merit in a composition. 
It is conceivable, but scarcely probable, that a literary dic- 
tator will arise whose popularity will be so great as to 
completely change the current of the world's literary 
appreciation. The nibbling of literary radicals is undoubt- 
edly producing small but continuous changes in the weight 
we attach to each of the numerous factors entering into a 



270 How to Measure in Education 

composition. As I sit in Morningside Park, numerous small 
caterpillars are spinning a basket web in a nearby bush. A 
large red spider has gratefully attached one edge of his web 
to the caterpillar's basket. Many of the caterpillars are 
so interested in the mechanics of their work that they often 
imprison themselves in their own silken net. The spider 
doubles their imprisonment by rolling them into little round 
balls. The radicals claim that our meticulous teaching 
methods have in similar fashion imprisoned the "rare spirit's 
wings" in the mesh of mechanics. We are only just begin- 
ning to investigate the extent to which literary evaluations 
vary with the age and sex of the judge, the type of literary 
instruction he has had, the purpose for which the composi- 
tions were written and the like. 

Peculiarity of Product Scales. — Composition, hand- 
writing and drawing scales are peculiar in that they are not 
tests at all. They are scoring instruments. For this reason 
as well as for the manner of their construction they are 
called product scales to contrast them with percentile, age, 
and grade scales which together are usually called perform- 
ance scales. The performance scales are placed in the hands 
of the pupils in order to test them. They are real testing 
instruments and not scoring instruments. Thorndike's 
Visual Vocabulary Scale A-2 is the test instrument. The 
scoring instrument is a correctly filled out test. Woody's 
Addition Scale is a test instrument. The scoring instrument 
is a set of correct answers. Collection of the pupils' com- 
position specimens is the composition test. The composition 
scale is only the scoring instrument. In the case of per- 
formance scales the dramatic instrument is not the scoring 
instrument but the testing instrument. To date, the scoring 
instrument has been assumed to be satisfactory. In the 
case of product scales the situation is exactly reversed. As 
will be suggested later, both scoring instrument and what is 
scored may be scaled. The following table will make clearer 
the relation between what is scored, the scoring scale, the 
scoring instrument, and the scale unit. 



Scaling the Test 



Thing Scored 



Man's height 
Until train leaves 
Heat of water 

Courtis Arith., Series B 

(a) Speed 

(b) Accuracy 

Woody Arith., Series B 
Thorndike-McCall Reading Scale 
Starch Handwriting 
Composition 



Scoring Instrument 



Scoring 
Scale 

Distance Yard stick 

Time Watch 

Temperature Thermometer 



Speed None 

Accuracy Correct answers 

Difficulty Correct answers 

Difficulty Correct answers 

Quality Handwriting specimens 

Quality Composition specime.-.s 



271 

Scale Unit 
Yd. ft. in. 
Hr. min. sec. 
Degree 



The example 
The example 

P.E. 

T. 
P.E. 
P.E. 



CHAPTER X 

SCALING THE TEST. T SCALE— AGE VARIA- 
BILITY UNIT 

I. The Method of Scale Construction 

Preparation of Test Material. — Recently I undertook, 
-at the suggestion of Thorndike, the task of constructing a 
much-needed series of reading scales. A careful study of 
previous methods of scale construction led to the conviction 
that there was great need for revision. Perhaps the best 
way to show the method evolved and why it was evolved 
to correct some of the defects of existing methods is to 
describe in detail just how one of these reading scales was 
constructed. The steps in the process, with some altera- 
tions, were as follows: 

1. Selections of prose and poetry were made which were 
brief, which gradually varied in difficulty from very easy 
to very difficult, which were fairly representative of read- 
ing material both in school and out, which were reasonably 
free from technical terms, and which were equally fair to 
rural and urban children. 

2. Questions were formulated which could be answered 
from or inferred from or were related to the reading selec- 
tions, which would yield brief, scorable answers, which 
were unambiguous, w^hose difficulty approximated that of 
the selection, and which were independent either in wording 
or difficulty of any preceding or succeeding questions, and 
which were numerous enough to make up one test of the 
desired length. 

3. Several experienced teachers answered all the ques- 
tions and assisted in arranging the questions in the order 

272 



Scaling the Test. T Scale — Age Variability Unit 273 

of their difficulty for school children, i. e., the questions 
judged to be easiest were placed at the beginning of the test 
and the most difficult ones at the end. 

4. The test when arranged was mimeographed along 
with the instructions to pupils. So far as possible the type 
and spacing and other features were identical with that to 
be used in the final printed scale. 

Scaling the Individual Questions. — 

5. The test was applied to a few hundred pupils in 
grades III through VIII. The total population in each 
school was tested so as to yield a rough approximation to 
a normal frequency distribution. 

6. The pupils' answers to the questions were scored as 
either right or wrong. 

7. The few questions which in the try-out proved am- ' 
biguous or unscorable or were otherwise unsatisfactory were 
eliminated. 

8. The results for the remaining questions were tabu- 
lated by each question for each pupil. The following shows 
how the tabulation was made. 



Pupil 


Selection I 


Selection II 


Selection III, etc. 


S. A 

R. N 


1234 
I I I I 
I I I 


12345 

I I 

1 I I I I 


123 
000 
010 



9. The total number of pupils answering each question 
correctly was computed and divided by the number of pupils 
tested to get the per cent of correct answers. 

10. This per cent was found in Table 23 and the corre- 
sponding difficulty or S.D. value was read. The difficulty 
value of each question so found was only roughly correct. 
Table 23 assumes a normal distribution of ability among the 
pupils. This assumption is sufficiently met by using only 
pupils in one grade or only pupils of one age. Combining 
grades makes the curve of ability somewhat too flat on the 
top. But the values from combined grades were accurate 
enough for the purpose. Computing from all grades com- 
bined saves much time. 



2 74 How to Measure in Education 

How Table 23 was used to convert per cents correct into 
S.D. values, i. e., difficulty values, is shown below. 

Selection I VIII 
Questions ..123 I 2 34 
Per cent cor- 
rect 99-99 99-7 98.13 0.61 0.0003 0.008 0,53 

S.D. Value . . 13. 22. 29. 75. 90. 88. 76. 

TABLE 23 

Shows the S.D. distance of a given per cent above zero. Each 
S.D. value is multiplied by 10 to eliminate decimals. The zero 
point is 5 S.D. below the mean 



S.D. 




S.D. 


Per 


S.D. 


Per 


S.D. 




Value 


Per cent 


Value 


cent 


Value 


cent 


Value 


Per ce: 





99.999971 


25 


99-38 


50 


50.00 


75 


0.62 


0.5 


99.999963 


25-5 


99.29 


50-5 


48.01 


75-5 


0.54 


I 


99.999952 


26 


99.18 


51 


46.02 


76 


0.47 


i.S 


99.999938 


26.S 


99.06 


51-5 


44.04 


76.5 


0.40 


2 


99.99992 


27 


98.93 


52 


42.07 


77 


0.35 


2.5 


99.99990 


27-5 


98-78 


52-5 


40.13 


77.5 


0.30 


3 


99.99987 


28 


98.61 


53 


38.21 


78 


0.26 


3.5 


99.99983 


28.5 


98.42 


53.5 


36.32 


78.5 


0.22 


4 


99.99979 


29 


98.21 


54 


34.46 


79 


0.19 


4.5 


99-99973 


29.5 


97.98 


54-5 


32.64 


79.5 


o.i6 


5 


99.99966 


30 


97.72 


55 


30.85 


80 


0.13 


5.5 


99-99957 


30-5 


97-44 


55-5 


29.12 


80.5 


O.ll 


6 


99.99946 


31 


97-13 


56 


27-43 


81 


0.097 


6.5 


99.99932 


31.S 


96.78 


56-5 


25-78 


81.5 


0.082 


7 


99-99915 


32 


96.41 


57 


24.20 


82 


0.069 


7-5 


99.9989 


32.5 


95-99 


57-5 


22.66 


82.5 


0.058 


8 


99.9987 


33 


95-54 


58 


21.19 


83 


0.048 


8.5 


99.9983 


33-5 


95-05 


58.5 


19.77 


83.5 


0.040 


9 


99.9979 


34 


94-52 


59 


18.41 


84 


0.034 


9-5 


99.9974 


34-5 


93-94 


59.5 


17.11 


84.5 


0.028 


10 


99.9968 


35 


93-32 


60 


15-87 


85 


0.023 


10.5 


99.9961 


35-5 


92.65 


60.5 


14.69 


85.5 


0.019 


II 


99-9952 


36 


91.92 


61 


13.57 


86 


0.016 


11.5 


99.9941 


36.5 


91-15 


61.5 


12.51 


86.5 


0.013 


12 


99.9928 


37 


90.32 


62 


11.51 


S7 


O.OII 


12.5 


99.9912 


37-5 


89.44 


62.5 


10.56 


87.S 


0.009 


13 


99.989 


38 


88.49 


63 


9.68 


88 


0.007 


13.5 


99.987 


38.5 


87.49 


63.5 


8.8s 


88.5 


0.0059 


14 


99.984 


39 


86.43 


64 


8.08 


89 


0.0048 


14.5 


99.981 


39.5 


85-31 


64.5 


7.35 


89.5 


0.0039 


15 


99-977 


40 


84-13 


65 


6.68 


90 


0.0032 


15.5 


99.972 


40.5 


82.89 


65.5 


6.06 


90.5 


0.0026 



Scaling the Test. T Scale — Age Variability Unit 275 



S.D. 




S.D. 


Per 


S.D. 


Per 


S.D. 




Value 


Per cent 


Value 


cent 


Value 


cent 


Value 


Per cent 


16 


99.966 


41 


81.59 


66 


5.48 


91 


0.0021 


16.5 


99.960 


41.5 


80.23 


66.5 


4-95 


91.5 


0.0017 


17 


99.952 


42 


78.81 


67^ 


4.46 


92 


0.0013 


175 


99.942 


42.5 


77.34 


67.5 


4.01 


92.5 


O.OOII 


18 


99-931 


43 


75.80 


68 


3.59 


93 


0.0009 


18.S 


99.918 


43.5 


74.22 


68.5 


3.22 


93.5 


0.0007 


19 


99.903 


44 


72.57 


69 


2.87 


94 


0.0005 


19-5 


99.886 


44.5 


70.88 


69.5 


2.56 


94.5 


0.00043 


20 


99.865 


45 


69.15 


70 


2.28 


95 


0.00034 


20.5 


99.84 


45.5 


67.36 


70.5 


2.02 


95.5 


0.00027 


21 


99.81 


46 


65.54 


71 


1.79 


96 


0.00021 


21.5 


99.78 


46.S 


63.68 


71.5 


1-58 


96.5 


0.00017 


22 


99.74 


47 


61.79 


72 


1.39 


97 


0.00013 


22.5 


99.70 


47.5 


59.87 


72.5 


1.22 


97.5 


0.000 10 


23 


99.65 


48 


57.93 


73 


1,07 


98 


0.00008 


23-5 


99.60 


48.5 


55.96 


73.5 


0.94 


98.5 


0.000062 


24 


99.53 


49 


53.98 


74 


0.82 


99 


0.000048 


24-5 


99.46 


49.5 


51.99^ 


74.5 


0.71 


99.5 
100 


0.000037 
0.000029 



Selection and Arrangement of Final Test. — 
II. The questions were rearranged in order of actual 
difficulty. Any question which had been greatly misplaced 
originally was eliminated entirely. Scales as they have usu- 
ally been scaled need to be scaled over again just as soon 
as they are finished. Because the original test has usually 
been much larger than the final test and because some of the 
test elements have been shifted, in the process of scale mak- 
ing, far from their original position, it has been found that 
the difficulty of test elements is different in the final 
scale from what their difficulty was in the original 
setting. 

The difficulty of any question is partly a function of its 
real difficulty, partly a function of some preceding question 
which may give a right or wrong mental set, and partly a 
function of its distance from the first question. Place a 
question nearer the beginning of a test and it tends to be- 
come easier. Remove it from its surrounding questions and 
unless it is highly independent of them its difficulty tends to 
vary. In carefully constructing a scale it is advisable for 



276 How to Measure in Education 

these reasons, to eliminate a question whose position must 
be greatly altered. For the same reasons it is well to go 
through this preliminary scaling in order that the final 
scaling will have permanence. 

All transient questions might have been eliminated. Thus 
far the difficulty of each question has been determined for 
the total group only. This is usually all that is necessary. 
But it may be the case that the order of difficulty for certain 
questions would be markedly different for different grades. 
Questions whose difficulty varies from group to group in 
this fashion are called transients. Undoubtedly a scale 
would be improved by the elimination of the worst of these 
transients. This may be done by repeating steps 9 and 10 
for each grade separately. It is doubtful whether the scale 
freed from transients is enough superior to justify the large 
amount of extra labor required. 

12. Any serious gap of difficulty in the scale which could 
not be filled by shifting a question from one position to an- 
other was filled by combining two or more questions and 
treating them as a single question in the final scale. Just 
as it was possible to compute, in step 9, the per cent of pupils 
who answered correctly each question and to convert this per 
cent, in step 10, into a scale value, so it was possible to com- 
pute for this same group the per cent of pupils who answered 
any two or any three questions correctly and to convert this 
into a comparable scale value. 

It should be noted that the per cent of pupils answering 
two questions correctly could not be found by averaging 
the per cent of pupils doing one of the questions correctly 
with the per cent answering the other one correctly. It 
should also be noted that two questions so combined were 
ever after treated as a single question. Both answers had 
to be correct before a pupil could receive credit for the 
question. 

13. The material thus selected and arranged was printed 
along with instructions, and other advice, in its final booklet 
form. 



Scaling the Test. T Scale — Age Variability Unit 277 

Application and Scoring of Final Test. — 

14. An effort was made to select for final testing purposes 
a group of schools which when combined would have at 
least 500 pupils between the ages of 12.0 and 13.0 and which 
when combined would give fairly representative pupils of 
all ages, i. e., the per cent of pupils of each level of ability 
found in any total unselected group. It was, of course, im- 
possible to find in schools representative pupils who were 
very young or very old. 

15. The test was applied to all pupils in grades III 
through VIII inclusive and to all pupils between the ages 
of 12.0 and 13.0 whether they were in ungraded classes or 
high school. When the test was given in cities so large 
that the total school population was not measured caution 
had to be exercised to test just the right number of 12-13 
year old high-school pupils. 

16. Each question was carefully scored as either right or 
wrong according to a set of guiding principles formulated 
for the purpose of making all scoring as uniform as possible. 
There is considerable evidence to indicate that the method 
of scoring with partial credits for questions in various stages 
of correctness is not enough superior to simply calling the 
pupiFs answer right or wrong to justify the extra trouble. The 
method of right-or- wrong scoring is not however obligatory. 

17. All correct answers to each question, the worst 
answers which were accepted and the best answers which 
were rejected as well as other sample answers were carefully 
tabulated as found to make a scoring key for the guidance 
of all future scoring both prior to the completion of the scale 
and after. 

18. The test booklet for each pupil was thrown into 
the pile for his age and grade. The following illustrates 
how this classification was made. The Roman numerals rep- 
resent grades and the Arabic numerals represent chrono- 
logical ages. There was a stack of papers for each Roman 
numeral, except in cases where no pupils of a given age 
group were found in the grade indicated. 



278 How to Measure in Education 

6 — 7 7 — 8 8 — 9 9 — 10 10 — II II — 12 



III 


III 


III 


III 


III 


III 


IV 


IV 


IV 


IV 


IV 


IV 


V 


V 


V 


V 


V 


V 


VI 


VI 


VI 


VI 


VI 


VI 


VII 


VII 


VII 


VII 


VII 


VII 


VIII 


VIII 


VIII 


VIII 


VIII 


VIII 


2 — 13 


13 — 14 


14 — 15 


15 — 16 


16—17 


17— I 


III 


III 


III 


III 


III 


III 


IV 


IV 


IV 


IV 


IV 


IV 


V 


V 


V 


V 


V 


V 


VI 


VI 


VI 


VI 


VI 


VI 


VII 


VII 


VII 


VII 


VII 


VII 


VIII 


VIII 


VIII 


VIII 


VIII 


VIII 



18 



Scaling the Total Number of Questions Right. — 

19. The total number of questions (single, double, or 
triple), which were answered correctly by each twelve-year- 
old pupil, was determined. This is shown in the first and 
second columns of Table 24. 

20. The per cent of all twelve-year-old pupils who ex- 
ceeded no questions plus half those who did no questions 
was computed. Thus of the 500 twelve-year-olds shown in 
Table 24, 497 made scores larger than no questions correct. 
Half of those answering no questions correctly plus 497 
makes 498.5 which is 99.7 per cent of 500. In similar 
fashion there was computed the per cent of those exceeding 
one question plus half those doing one question, and then 
the per cent of those exceeding two questions plus half those 
answering two questions, and so on. This yields the per 
cents shown in the third column of Table 24. 

21. These per cents were converted into S.D. values or 
scale scores by the use of Table 23. Thus a per cent of 
99.7 is equivalent to a scale score of 23, and a per cent of 
99.3 is equivalent to a scale score of 25, and so on for the 
other per cents. This means that pupils who are unable to 
answer a single question on the test should be assigned a 
scale score of 23, those who answer one question correctly 
should be assigned a scale score of 2 5 and so on. 



Scaling the Test. T Scale — Age Variability Unit 279 



TABLE 24 

Shows How to Scale Total Scores 



Total No. 


No. of 


Per Cent Exceeding 


c^ 1 


Questions 


Twelve-Year-Old 


Plus Half Those 


Scale 


Correct 


Pupils 


Reaching 


bcore 





3 


99-7 


23 


I 


I 


99-3 


25 


2 


2 


99.0 


27 


3 


I 


98.7 


28 


4 


2 


98.4 


29 


5 


2 


98.0 


29 


6 


2 


97.6 


30 


7 


2 


97.2 


31 


8 


4 


96.6 


32 


9 


2 


96.0 


32 


10 


2 


95.6 


33 


II 


10 


94.4 


34 


12 


3 


93-1 


35 


13 


8 


92.0 


36 


14 


8 


90.4 


37 


15 


13 


88.3 


38 


16 


15 


85.5 


39 


17 


18 


82.2 


41 


18 


28 


77-6 


42 


19 


26 


72.2 


44 


20 


34 


66.2 


46 " 


21 


40 


58.8 


48 


22 


40 


50.8 


50 


23 


41 


42.7 


52 


24 


37 


34.9 


54 


25 


31 


28.1 


56 


26 


35 


21.S 


58 


27 


24 


15.6 


60 


28 


26 


10.6 


62 


29 


21 


5-9 


66 


30 


14 


2.4 


70 


31 


3 


0.7 


75 


32 


I 


0.3 


78 


33 


I 


0.1 


81 


34 







85 


35 





1 


90 



2 8o How to Measure in Education 



a 


»^ 


a 


H 


^ 








"o 




U 






V3 


Vh 


N 






C^ 









J3 




-t-> 




.S 


«! 




h' 


T3 





wOOOOOOOOOOOOwOi-ilMwMi-iNMNwOM VOfON 

00 



O 

c *> . 

I— t ^ MOOOli-iOi-iOf^ONi-'i-iinTj'TJ-Hi-i OnOO U-) -^OO '+fOt^<^0(^^ tv.'^rO fOOO N O 

CO CO 0\ 

a ^ 

O w 

CO 

5 ^ . 

o ■<i- lo 

w <^ 

a 

^ J^ rOwC<iHC^MNM-<tC1MO fOOO C» fO iCOO COMDTj-OOi-<>^>-iiri -^vo MTfPO'-'Hi Ot>.6d 



to ^ 



tJo wwNrONOO'^CO'^rorqPli-ii-ii-i <M«r>'<*'* 

M U O 

-3 bo 

> '*? . 

M 2 M ^ M CO io\o in covo moo "1 ^^ (^ tN. Lo CO ■<+ u-j r> po lo o txMD t>. 0\ i-i fo tx c^ w 0\ fo ^ w 









t>.oo ««; 






S . o o . 

^ ^ ^ mm 

^ ^ *; ro m WOO O^CftOwNMrorJ- lO^ t^OO Oi w <^ t)-vo 00 O C<1 rfVO 00 O N VO O inCO hi tn '-'■' fl 

«3vj f^f^^^^p^c^^cofot<5^0fOfotoc«5fOfoco■*•*Tl-■>:J■•^l■u^lr)lom m^ vo "O t>. t>. tNoo ttJ m u rt 

to , ,.— , c h 

^ ? 2 ^ rt rt ^ 

S So 0'-if<C0-<4- mvo ^sOO 0\0 w ^l con-ir)\0 r^OO O^OM^^^0■*lO^O^^00O^Ol-l'^^O OO^^" 

^ Ql'-S iHi-iwMi-iMf-ii-ii-(McsirqtSCi(srqNO)c>ioqcoro<orofHH^H 



Scaling the Test. T Scale — Age Variability Unit 281 

Determination of Age Norms. — 

22. A table was constructed showing the number of 
pupils of each age answering correctly a defined number of 
questions. Table 25 shows the results. 

23. The total number of pupils for each age, the total 
scale score for each age, and the mean scale score for each 
age was computed. These are shown at the bottom of 
Table 25. The second vertical column rather than the first 
vertical column of Table 25 was, of course, used in making 
the latter determinations. 

24. An effort was made to determine a truer mean scale 
score for each age than was yielded by step 23 above. 
Since few tests were given to any grade below the third it is 
evident that, on the whole, only the brightest seven-year-olds 
and eight-year-olds were tested. The less gifted were still 
in the kindergarten, first grade, or second grade. This fac- 
tor of selection has operated to make the mean for these two 
ages in particular unduly high. Since no tests were given 
in high school except to twelve-year-olds and since the 
brightest pupils above the age of about thirteen have gradu- 
ated from the elementary school or have left to go to work, 
thereby leaving the stupider children in the elementary 
grades, it is evident that the mean scale for each of these 
upper ages is too low or is unreliable because of the small 
number of cases. 

That the factor of selection has been operating can be 
shown not only by logic but also by age-grade statistics. 
Since age-grade tables have been widely circulated by Ayres, 
Strayer, Thorndike, and others, repetition here is not neces- 
sary. 

Not only selection but also the amount of the selection is 
indicated by the total number of pupils of each age shown at 
the bottom of Table 25. Special pains were taken to secure 
all the twelve-year-olds. Practically all of them had entered 
school. None had left school except to enter high school 
where they were found and tested. Five hundred twelve- 
year-olds were found. As shown in Table 25 the lower the 



282 How to Measure in Education 

ages go below twelve the smaller the number of pupils be- 
comes even when it is certain that there are at least as many 
children for each age below twelve as there are for age 
twelve. A similar decrease in numbers occurs above twelve. 

The method of determining the truer means for ages 
below and above the age of twelve was to compute a median 
score on the rough assumption that there existed in the 
schools or communities 500 children of each age whether 
tested by me or not, and that the untested younger children 
were below the median of their own age group while the un- 
tested older children, particularly the compulsory-school- 
age children of 13 and 14, were above the median of their 
own age group. Since 500 minus 35 equals 465 which is far 
in excess of the median 250th seven-year-old child the truer 
mean for this age is evidently below the lowest extremity of 
the scale. Since 500 minus 173 equals more than 250 it is 
equally impossible to determine by this method the truer 
mean for eight-year-olds. It is possible, however, to find a 
truer mean for nine-year-olds, since 500 minus 347 equals 
153 which is less than 250. To find the median score for this 
age it is necessary to count down 250 minus 153, i.e., 97. 
This gives a median score or truer mean of 35. In similar 
fashion a truer mean was computed for ages 10 and 11. 
The computation of a truer mean for ages 13 and 14 was 
simpler for here the non-tested pupils were not missing from 
the bottom but from the top of the group. Hence all that 
was needed was to count down 250 pupils. Since ages 
15, 16, and 17 did not have 250 pupils a truer mean for them 
could not be determined. 

As stated before, it is certain that the means for the lower 
ages are too high and for the upper ages too low. It is 
equally certain that the truer means for the lower ages are 
too low and for the upper ages too high. The method of de- 
termining the truer mean assumes that the untested younger 
pupils are all below the median of their own age group and 
that the opposite situation obtains for the older children. 
For this assumption to be absolutely correct would require, 



Scaling the Test. T Scale — Age Variability Unit 283 

among other conditions, that the system of promotion in the 
public schools be very accurate. Since we know that it is 
very fallible we can be morally certain that some of the 
young untested children would have made records above 
the median of their age group had all been tested. Since 
the number of these would undoubtedly have been compara- 
tively i&v^ there are strong grounds for believing that the 
truer mean scale score is a closer approximation to the gen- 
uinely true mean for which we are searching than the mean 
scale score. 

There are highly technical statistical procedures for infer- 
ring the location of the true mean, but inspection, guided 
by the mean and the truer mean, was accurate enough for the 
present purpose. The row of crosses in Fig. 3 pictures the 
means, and the row of circles the truer means of Table 25. 
The straight line, following the row of circles more closely 
than the row of crosses, gives my estimate of the position of 
the true mean and consequently of the true age norms. 

The line for the true means shown in Fig. 3 is a straight 
line and has been extended above and below ages for which 
actual data is available. The most probable true means 
coincide so closely to a straight line that a straight line has 
been assumed in order to simplify interpolation in the com- 
putation of month standards from year standards. There 
is strong probability that for very young ages the line should 
bend downward. There should probably be a similar bend 
for ages beyond about 15. Proof of this last is the fact that 
the mean scale score for very superior teachers is just about 
equivalent to the standard score for age 18 if the true-mean 
line is continued upward as a straight line. Since, however, 
the line of the true means is a straight line for most of the 
ages it is continued as a straight line above and below until 
data has been secured upon which to estimate the course of 
the line at the two extremes. 

The straight line continuation has the added practical 
advantage of making it possible to assign to any pupil a 
reading age though he makes a scale score of 90, even when 

































/ 


/ 


OD 






























/ 




80 




























/ 


r 




75 
























/ 


/ 








70 






















• 


/ 






























/ 


f 










65 






















/ 






























/ 














60 
55 
50 


u 
















r^ 


/ 





























/ 


; 






X 






















<d 




X 


X 


X 
























r 
































^ 


f 




















45 












^ 






























> 


A 






















^n 










/ 






























X 


7 

'(7\ 


r 






ME 
TR 


A^ 

UE 


IS( 


:ai 


:Ar 


S( 


;oi 




IE 

© 


XX 


3D 




X 


/ 


\V 








TR 


UE 


•MLA 


N 


__^ 


— 






30 




> 


f 






























/ 






























25 


/ 


f 






























/ 
































20 


r 












AC 


jE. 


YM 


R? 








_ 


^ 



6 7 8 9 10 II 12 13 14 15 16 17 18 192021 22 

Fig. 3. Shows the Obtained Mean Scale Scores for Each Age, the Truer Mean 
or Estimated Median Score, and the True Mean Based upon the Obtained 
Mean and Truer Mean. (See Table 25.) 

284 



Scaling the Test. T Scale — Age Variability Unit 285 

it is certain that no age, however old, has a true mean as 
high as some younger children can attain. This practical 
advantage does not, however, outweigh the necessity of de- 
termining empirically and reporting age standards for very 
young and very old children. 

Norms for each month can be read from the straight line 
in Fig. 3. The mean and truer mean scores shown in Table 
25 are, of course, for ages yj^, 8}^, and so on. 

Determination of Grade Norms. — 

25. A table similar to Table 25 was constructed showing 
the number of pupils in III A, III B, IV A and so on making 
defined scores. A mean scale score was computed for each 
section of each grade. This gave norms for grade and sec- 
tion. 

Special Extension of Scale. — Step 25 completes the 
scale and the establishment of age and grade norms unless 
it is desired to test pupils whose ability extends beyond the 
range of the ability of twelve-year-olds. Except in very 
rare tests twelve-year-old ability will go down as low as it is 
ever necessary to measure. In such rare traits the scale may 
be specially extended downward. Very few pupils will ever 
be found in the elementary school who possess an ability in 
excess of that of the ablest twelve-year-old. A few high 
school students cannot be given their exact scale score unless 
the scale is specially extended upward. Any student 
answering more than 33 questions correctly could not, 
according to Table 24, be assigned any other scale score than 
81 plus. If a scale is intended for frequent use with ad- 
vanced high-school students in addition to elementary school 
pupils it will be desirable to extend the scaling above 81. 

26. Step 19 was repeated for all sixteen-year-old high- 
school students in a certain high-school. Students who were 
fourteen years old or fifteen years old might have been used 
instead. 

27. It was determined that in this particular community 
20 per cent of all sixteen-year-olds were in high school. 
This meant that the 200 sixteen-year-olds tested were, on the 



286 How to Measure in Education 

whole, the brighter portion of the looo sixteen-year-olds in 
the community. 

28. Beginning at the bottom of the frequency distribu- 
tion mentioned in step 26, the number of sixteen-year-olds 
exceeding-plus-half- those-reaching 35 questions correct was 
determined. A similar determination was made for 34 ques- 
tions correct, and then for 33 questions correct. 

29. To get the per cent of sixteen-year-olds exceeding- 
plus-half-those-reaching 35 questions correct and then 34 
questions correct, and then 33 questions correct, the num- 
bers found in step 28 were divided, not by 200, but by 1000. 
This is an approximate correction for the failure to test all 
sixteen-year-olds in the community. 

30. These per cents were converted into S.D. values by 
means of Table 23. 

31. The scale score for 34 questions correct was 4 points 
above the scale score for 33 questions correct, so 4 points 
were added to the 81 shown in the last column of Table 30. 
The points difference between 35 questions and 33 ques- 
tions was 9, so 9 was added to the 81 of Table 24. This 
gave a scale score of 85 for 34 questions correct, and a scale 
score of 90 for 35 questions correct, thus extending the scale 
upward. 

A special extension of a scale downward will rarely be 
necessary. If so it can be done by repeating steps 26, 27, 
28, 29, 30, and 31 for, say, eight-year-olds. In this case 
scale-score differences, according to eight-year-olds, would 
be subtracted from the lowest scale score, according to 
twelve-year-olds, instead of being added to the highest scale 
score as described above. 

How to Increase Accuracy of Scaling. — The only 
substantial defect of the T scale is the tendency toward un- 
reliability of the lower and upper scale scores. While not 
necessary it is advisable to correct for this defect by repeat- 
ing the process shown in Table 24 for eleven-year-olds and 
then for thirteen-year-olds. The average of these three 
scale scores for each total number of questions correct will 



Scaling the Test. T Scale — Age Variability Unit 287 

approximate what would have been secured had three times 
as many twelve-year-olds been tested. If it is feared that 
the thirteen-year-olds are not about as much above twelve- 
year-olds as eleven-year-olds are below twelve-year-olds, the 
scale scores as of eleven-year-olds and of thirteen-year-olds 
may each be equated with those for twelve-year-olds before 
the three scale scores are combined. The procedure for 
equating is extremely simple. For example, suppose that 
the scale score for 22 questions correct is 51 for the eleven- 
year-olds, 50 for the twelve-year-olds, and 49 for the thir- 
teen-year-olds. By adding i to the scale scores for eleven- 
year-olds and by subtracting i from the scale scores for 
thirteen-year-olds both series will be made comparable to 
the scale scores for twelve-year-olds. This is on the reason- 
able assumption that the variability for these three ages is 
approximately the same. If desired the amount to be added 
and subtracted can be determined more exactly by using 2 1 , 
23, 20, 24 and other number of questions correct to supple- 
ment the determination for 2 2 questions correct. 

Publication of Scale. — 

32. A leaflet was prepared for distribution with each 
package of test sheets purchased. This leaflet gave detailed 
directions as to how to apply and score tests, how to com- 
pute pupil and class scores, how to utilize age and grade 
norms and the like. All too frequently excellent scales are 
practically valueless because these essential details are given 
little attention or are totally overlooked. 

This completes the process and the scale is ready for use. 
The additional technique yet to be described, namely the use 
of a calibrator, applies only in case two or more duplicate 
scales are being constructed. 

Use of Calibrator in Scale Construction. — Since not 
one but several reading scales were made at once a calibra- 
tor, which is merely a few questions common to each pre- 
liminary and final test or group of tests, was used in con- 
structing the reading scales. While a calibrator is never a 
necessity it is of value when the number of different prelim- 



2 88 How to Measure in Education 

inary and final tests become too great to apply them all to 
the same pupils, or when, for some other reason, it becomes 
necessary to apply part of the tests to one group of pupils 
and part to another presumably equivalent group. 

If the distribution of the ability of one group is identical 
with that of the other group, a calibrator is superfluous. If, 
however, there is a difference between the two groups, the 
scale difficulty of each question may differ so much that 
S.D. values are not comparable from group to group and 
hence questions cannot conveniently be shifted at will from 
one preliminary test to another in arranging the final tests. 
Also the scale scores for each total number of questions and 
the age and grade standards in terms of these scale scores 
will not be exactly comparable from scale to scale. The 
calibrator questions, since they are answered by the pupils in 
each group, become a means of referring all S.D. values, 
whether for each individual question or for each total num- 
ber of questions correct, to one group or to the mean of the 
two groups. 

Suppose that 49 per cent of one group of twelve-year-olds 
answers a certain calibrator question correctly while 52 per 
cent of another group answers this same question correctly. 
According to Table 23 the question has an S.D. value of 50 
for the former group and an S.D. value of 49 for the latter 
group. The difference is i in favor of the latter group. Ev- 
idently the latter group is, on the average, composed of abler 
pupils. By averaging the S.D. differences between the 
groups on several calibrator questions an accurate measure 
may be secured of the mean difference in ability between the 
two groups. 

Let us suppose that the mean difference turns out to be 
2 in favor of the latter group. By adding 2 to the difficulty 
value of each question, according to the latter group, all 
values for the individual questions become comparable to 
those for the former group. Or by subtracting 2 from the 
difficulty values as determined by the former group these 
values become directly comparable to those for the latter. 



Scaling the Test. T Scale — Age Variability Unit 289 

Or by adding one-half of 2 to the values for the latter group 
and by subtracting one-half of 2 from the values of the 
former group the difficulty values of all questions are made 
what they would have been had all questions been answered 
by all pupils in both groups. In this way a scale construc- 
tor can get the benefit of a larger number of pupils without 
actually trying all of his material upon all the pupils. 

The same difference that is used to equate the scale values 
of different questions may also be used in the same way to 
equate, with reference to either group or the mean of both 
groups, the scale scores for each total number of questions 
correct.^ 

The statements made in the two preceding paragraphs 
are only roughly true. The difficulty value of any particu- 
lar question or the scale score of any total number of ques- 
tions correct is a resultant partly of the mean ability of the 
group and partly of the form of the distribution of that abil- 
ity. When one group of pupils is better as a whole or poorer 
as a whole than another group the calibrator will smooth out 
the difference. The calibrator concerns itself only with 
average differences between groups. 

After the scale scores on the final scales have, in this way, 
been made mutually comparable, or even when no attempt 
has been made to make them comparable, it is advisable to 
average the age norms on all the scales and the grade norms 
on all the scales to get more accurate age and grade norms. 
When, however, comparability of values has been established 
it is really necessary to establish age and grade norms for 
only one of the scales. 

Short Cuts in Scale Construction — Scaling Teachers' 
Examinations. — The process of scale construction may 
be shortened by the elimination of steps 3, 4, 5, 6, 7, 8, 9, 10, 
II, and 12. It sometimes happens that the test questions 
are selected for some diagnostic reason and that, as a con- 

^ At the time of writing there is some doubt about the correctness of this state- 
ment. It is possible that a calibrator may be used as recommended to equate scale 
values of individual questions, but not scale scores. The problem is being studied 
empirically. 



290 How to Measure in Education 

sequence, no questions must be eliminated because of scor- 
ing and statistical considerations. Again, it sometimes hap- 
pens that the arrangement of the questions should be based 
upon considerations other than those of difficulty and that 
this arrangement can be finally decided without a prelim- 
inary try-out. In making the reading scale, for example, it 
was impossible to so arrange the questions that each ques- 
tion would be more difficult than the immediately preced- 
ing question. Each question had to be grouped with the 
prose or poetry selection upon which it was based. Finally, 
there is adequate justification for the contention that test 
elements need not advance with continuous increases in 
difficulty and that only a rough arrangement in order of diffi- 
culty is necessary. 

Further, the process can be greatly simplified if the pur- 
pose is merely to construct a scale and not to determine age 
and grade norms. In this case all that is necessary is to call 
together in each school and test the twelve-year-old pupils. 
This reduces the process to five steps as follows: 

1 . Prepare the test in its ultimate form. 

2 . Collect and test unselected twelve-year-olds. 

3. Score the test sheets and compute the total number of 
test elements correct for each pupil. 

4. Compute the per cent of pupils exceeding plus half 
those reaching each total number of test elements correct. 

5. Convert these per cents into scale scores by the use of 
Table 23 and the scale is finished. 

This process is so brief and simple that teachers can use 
it advantageously for scaling their examinations, thus: 

1. Apply the examination to the pupils in the class. 

2 . Score each question in the examination as either right 
or wrong or in any other way that may seem advisable. 

3. Compute the total score for each pupil on the entire 
examination. 

4. Compute the per cent of pupils in the class exceeding 
plus half those reaching each score made. 



Scaling the Test. T Scale — Age Variability Unit 291 

5. Convert these per cents into scale scores by means of 
Table 23. 

6. Give each pupil his scale score instead of his original 
total number of points on the examination. 

A teacher will find such a scale score the most convenient 
form in which to keep her record of the pupils. Such a 
score will make the pupil's record on any examination com- 
parable with his record on any other examination. Further- 
more, since the score is in statistical form it is possible for 
the teacher to combine records by simple addition. When 
records are kept in terms of A, B, C, D, E, statistical manip- 
ulation is impossible. 

II. Evaluation of Methods of Scale Construction 

Reference Point. — Whatever the measurement scor- 
ing must have some starting point — some reference point. 
Kalamazoo has a location, but the location is not very intel- 
ligible to anyone unfamiliar with Kalamazoo unless given 
some reference point or points. If we say Kalamazoo is so 
many degrees west longitude and so many degrees north 
latitude, the reference points are the line of longitude pass- 
ing through Greenwich and the line of latitude corresponding 
to the equator. According to scientific measurement the 
reference point for measuring an individual's height is either 
the soles of the bare feet or the actual crown of the head. 
Whether the thing measured be distance, time, weight, 
courage, reading ability, or arithmetical skill, there must be 
a starting point for scoring. 

The following drama will ilhistrate the need for a com- 
monly understood reference point: 

TRAGI-COMEDY OF ERRORS 

ACT FIRST 

Railroad Station, Richmond, Va. 
Enter Traveler, Native of Baltimore, Native of Savannah, Bostonian 

and Author of "HOW TO MEASURE IN EDUCATION." 
Traveler: Is New York City farther than Philadelphia? 
Author: Define your point of reference. {Exit Author.) 
Native of Baltimore: Yes. 



292 How to Measure in Education 

Native of Savannah: Yes. 

Bostonian: No! {Exit Bostonian.) 

Traveler: How much farther is New York City than Philadelphia? 

Native of Baltimore: About twice. 

Native of Savannah: About one-tenth! 



The End 

Scientists soon discovered that scientific progress was 
handicapped by the fact that different individuals were using 
different reference points when measuring temperature. 
Finally after long wasteful delays two competing reference 
points have been adopted, one which places the zero point 
of the temperature scale 32 degrees below the freezing point 
of water and one which locates it at the freezing point of 
water. In similar manner scientists agreed to make the zero 
for the height of land forms the sea level. They could have 
made it the center of the earth or the base of the Acropolis. 
In the measurement of many things in life then there is no 
one point divinely called to be zero. Convenient zero points 
have been proposed, debated, and arbitrarily adopted. 

Mental measures have for years been searching for an 
appropriate reference point or points. The tendency has 
been to search for some absolute zero point for the trait 
being measured. This has resulted in a different zero point 
for each scale made. If the process continues we shall have 
hundreds of zero points each of which is extremely nebulous, 
and no one of which is generally accepted. The resulting 
confusion would enormously handicap the development of 
mental measurement. 

We have had not only a different reference point for each 
test, but different methods of locating this point. First: the 
reference point on unsealed tests is just no score on the 
material of the particular test. Second, the reference point 
on certain scales is a zero point guessed at by the author of 
the scale. Third, the reference point on other scales, partic- 
ularly judgment scales, is the median judgment of judges as 
to the location of zero merit in composition. Fourth, the 
reference point on other scales is a zero point located by the 



Scaling the Test. T Scale — Age Variability Unit 293 

use of the per cent of pupils in some early grade who make 
no score on very easy material. Fifth, the reference point 
for other scales is 3 S.D. below the mean of the group for 
whom the test was devised. Sixth, the reference point on 
another scale is simply the lowest score made. Still other 
methods of locating reference points have been used. 

Since few mental measurers agree as to the best method of 
locating a zero pointy since few agree as to just what the zero 
for reading or any other mental trait is, since any such point 
if actually found is bound to be relatively invisible and hence 
more or less valueless as an aid to the proper interpretation 
of scores, since prevailing methods of locating zero is certain 
to produce as many different points as there are scales, and 
since this last must inevitably result in general confusion, 
this book proposes that a common reference point be arbi- 
trarily adopted for all tests which are to be used in the ele- 
mentary and perhaps high school. It is recommended that 
this reference point be not a zero point at all but instead the 
mean performance of children between the ages of 12.0 and 
13.0. Such a reference point could be used for any mental 
trait regardless of the location of its absolute zero, if such 
there be. 

The method of scale construction described in the preced- 
ing section makes the mean performance of twelve-year-olds 
the point of reference. Any pupil who makes a scale score 
of 50 has an ability equal to the mean ability of twelve-year- 
olds. According to Table 23 his ability is exceeded by 50 
per cent of twelve-year-olds. Any pupil who makes a scale 
score of 40 has an ability which is 10 units or i S.D. below 
the mean ability of twelve-year-olds. According to Table 
23 he is exceeded by 84.13 per cent of twelve-year-olds. 
Any pupil whose scale score is 75 is 2.5 S.D. above the mean 
ability of twelve-year-olds. Only a little more than six- 
tenths of one per cent of twelve-year-olds exceed this record. 

Actually in a mathematical sense the zero of the scale is 5 
S.D. below the mean of twelve-year-olds. This is a case 
where the zero of the scale is not the actual reference point. 



294 How to Measure in Education 

The mathematical zero was located at 5 S.D. below the 
mean instead of at the mean for three reasons. First, this 
procedure eliminated cumbersome plus and minus signs. 
Had the zero been placed at the mean it would have been 
necessary to report a pupil's score as — 3 S.D., or + 2.5 
S.D., or the like. It is much more convenient to report a 
pupil as 20 or 75, or the like. Second, this procedure gives a 
convenient range of points between i and 100 and locates the 
reference point at the easily remembered 50. Third, this 
procedure carries the scale down as far and up as far as any- 
one will need to go. This would not be the case if the scale 
ranged only from — 3 S.D, to + 3 S.D. Fourth, this pro- 
cedure gives a mathematical zero which is close to the sup- 
posed absolute zeros for reading, writing, spelling, compo- 
sition, completion and other typical mental functions. An 
investigation of several scales which have been constructed 
by Buckingham, Trabue, Woody and others showed that the 
zero points selected by them were only slightly above 5 S.D. 
in terms of twelve-year-olds below the mean of twelve-year- 
olds, thereby compensating for the general feeling on the part 
of most of these authors of scales that their zero points were 
probably a bit high. In sum, 5 S.D. below the mean of 
twelve-year-olds is itself a reasonably good zero point for 
many commonly measured abilities and in addition serves 
to accentuate an even better reference point as well as 
serving other practical purposes. 

Which sort of reference point should be used depends in 
part upon the use to be made of the measurements. It is 
generally acknowledged that the best way to appreciate the 
ability of a pupil is to refer him to the mean ability of his 
own or some standard group. On the other hand, it is ac- 
knowledged that the best reference point to employ when the 
times statement is made is the absolute zero of the trait in 
question. The absolute zero is desirable, for example, when 
it is desired to state that James has two times the ability of 
John. The proposed actual reference point meets almost 
perfectly the needs of the former group. The proposed 



Scaling the Test. T Scale — Age Variability Unit 295 

mathematical zero at 5 S.D. below the mean of twelve-year- 
olds also meets the needs of the latter group reasonably well, 
for 5 S.D. below the actual reference point is an approxi- 
mate absolute zero for most school traits. A slight conces- 
sion on the part of those who prefer an absolute zero will 
make possible a uniformity which, in my judgment, more 
than compensates for the loss entailed in making this con- 
cession. 

Another objection which may be urged against the mean 
of twelve-year-olds as a reference point is that any score 
above or below this point would not indicate whether the 
pupil possessed much or little of the trait being measured, 
for the mean twelve-year-old pupil possesses relatively little 
of certain traits and relatively much of certain other traits. 
The only cure for this defect is to use as the reference point 
the undiscoverable absolute zero of the trait and, in addition, 
to lose one of the values of the proposed reference point. 
If the absolute zero point is found and used no one score can 
always indicate the mean ability of twelve-year-olds. It is 
probable that is is easier for teachers to estimate the amount 
of a trait possessed by a pupil from a knowledge of how 
much a mean twelve-year-old pupil possesses than from a 
knowledge of a score's distance above some theoretical abso- 
lute zero. Furthermore, the material of the test itself is the 
best common sense index of the amount of ability possessed 
by anyone taking it. 

The only other convenient reference point which has been 
proposed is the time of birth. Such a reference point is 
used in the Stanford Revision of the Binet-Simon Scale. In 
fact, its use has been rather general in the case of intelligence 
scales which are true scales. The chief objection to the 
universal adoption of this popular and easily understood 
reference point is that it practically compels the adoption 
of a measuring unit which, as will be shown later, is less sat- 
isfactory. 

Unit of Measurement. — Just as all measurement re- 
quires some reference point so all measurement requires 



296 How to Measure in Education 

some unit. The reference point for a mountain is sea level. 
Its height above this reference point is expressed in terms 
of a certain measuring unit called a joot. The reference 
point for measuring time is the birth of Christ, or January 
ist, or 12 M., and the units are centuries, years, days, hours, 
minutes, and seconds. 

The variety of reference points is almost equaled by the 
variety of units for mental measurement. The reader can 
get some idea of the situation by remembering his own 
"buzzing blooming confusion" upon being suddenly asked to 
learn the tables for meters, liters, grams, pounds, marks, 
shillings, yards, dollars, etc. This confusion multiplied ten- 
fold is a sample of the present chaos in mental measurement. 
Some employ simple scoring units, others employ as a unit 
some function of the variability of any one grade, others 
employ some function of the variability of all the grades 
treated separately and then combined through the determi- 
nation of inter-grade intervals, others employ some function 
of the variability of all grades combined, others employ the 
variability of judgment, others employ the variability of 
adult performance, others employ relative worth, and others 
employ still different units. This book proposes that a single 
common unit of measurement be adopted for all mental 
scales to be used in the elementary school, namely, some func- 
tion, preferably S,D., of the variability of twelve-year-old 
pupils. It is further proposed that a similar unit based 
upon, say, sixteen-year-old pupils, be used for tests designed 
especially for high schools. Tests designed especially for 
the primary grades could be scaled for, say, eight-year-olds. 

The highest point in the technique of mental scale con- 
struction has been reached by Thorndike and his students. 
They have used some function of variability as a unit and 
this is admirable. They have used the variability of a 
grade, which is not so admirable. Many forces are at work 
such as reorganizations of grade systems, improvements in 
classification, and the like, which are bound to profoundly 
alter, in a relatively short time, all scale values and the sig- 



Scaling the Test. T Scale — Age Variability Unit 297 

nificance of the scale units employed. Any unit based upon 
such an artificial and ephemeral group as a grade lacks the 
necessary permanence. They have used values based upon 
the variability of several grades and have combined these 
values through an elaborate procedure of weighting values 
and determining inter-grade intervals. This procedure has 
the merit of giving temporarily reliable results, but the whole 
procedure is altogether too laborious for it to be generally 
used. Furthermore, the values when pooled for several 
grades with intricate weightings cease to be interpretable. 
The only sort of variability which has much meaning is the 
variability of some one defined group. Even so this group 
of scientific workers has done more to further the cause of 
accurate scale construction than any other group in the 
world. 

The other high point in scale construction began with 
Binet and Simon and culminated in the Stanford Revision 
of the Binet-Simon Scale by Terman. This line of develop- 
ment has been popular rather than technical. Its reference 
point has been the time of birth, and its unit of measure- 
ment has been one year of growth or some subdivision 
thereof. These are reference points and scoring units which 
all can understand. They utilize chronological age, one of 
the most abiding features of human life. 

There are, however, some very serious objections to this 
unit of measurement. While a permanent one, it is not 
equal in the truest sense of the word, at all points on the 
scale. A fact, now taken for granted, is that the interval 
between 8 and 9 years of age is larger than the interval 
between 14 and 15, in the case of intelligence and probably 
for many school traits as well. Furthermore, in the case of 
certain mental traits, the units become of zero size beyond 
about sixteen years of age. In abilities where a loss occurs, 
after instruction in the elementary school ceases, the age 
unit may be actually less than zero, i. e., negative. Finally, 
because of the late entrance into school of some pupils and 
because of the disappearance into the social medium of a 



298 How to Measure in Education 

goodly per cent of the graduates and over-compulsory 
school-age pupils of the elementary school, it becomes diffi- 
cult, if not impossible, to build up a scale below an age of 8 
years and above an age of 12 years. This means that such 
a restricted scale cannot satisfactorily score a very poor or 
very able pupil. To accurately extend the scale so it will 
measure these individuals requires that a test be previously 
scaled beyond these points by some other method of scaling. 
Lending itself as it does to easy interpretation and to the 
ready computation of quotients, the age unit is deservedly 
popular. It is, however, most serviceable in the realm of 
intelligence measurements where, presumably, retrogressions 
do not normally occur until senility sets in. Its defects 
are almost fatal for universal mental measurement. Even 
so it is at least the second best unit for universal adoption. 

The unit proposed for universal adoption by this book, 
namely one-tenth S.D. of twelve-year-old children has long 
been used by careful mental measurers to compare pupils 
with other pupils in their own age group. This unit is 
equal at all points on the difficulty scale, which is the chief 
characteristic of the unit employed by Thorndike and his 
students. It is based upon chronological age which is the 
chief characteristic of the work of Terman and his predeces- 
sors. It is a function of the variabiHty of a defined group 
and a group which is easily located. A scale which uses this 
unit reaches as low and as high as the ordinary requirements 
of practical measurement. Special extension at the top or 
bottom is a simple process. And not of least importance is 
the fact that the construction of the scale which employs 
this unit is not particularly laborious. The subsequent 
improvement of scale values is simpler still. In sum the 
proposed unit combines most of the virtues and eliminates 
most of the defects of the two chief contemporary methods 
of constructing mental scales. In a certain sense it unites 
the two great lines of scale development. Since the two 
greatest contemporary exponents of these merging methods 



Scaling the Test. T Scale — Age Variability Unit 299 

are Thorndike for the one and Terman for the other, it is a 
tribute to their genius to call the proposed unit, namely one- 
tenth S.D. of unselected twelve-year-old children, a Thorn- 
dike-Terman, or, for brevity, a T. 

This variability-of-an-age-group unit has long been used 
among psychologists. But they have used the unit for pur- 
poses of comparing a pupil with other pupils in his own age 
group. Table 23 will facilitate such use. The greatest 
need, however, is to compare, in some quantitative fashion, 
a pupil in one age group with another pupil in another age 
group or the mean of one age group with the mean of 
another age group. This need is met by employing the T 
for twelve-year-olds as has been done in this chapter. 

The proposed unit and method of scale construction is 
applicable in almost if not all situations where mental traits 
are being measured. As has been seen no obstacle has ap- 
peared when the object was to make a scale to measure how 
much difficulty a pupil could achieve. Instead of difficulty 
the basis could just as well have been relative worth or fre- 
quency of occurrence or something else. Any sort of a basis 
which somehow yields varying scores for pupils may be used. 

Even product scales may be transmuted into T scales, 
thereby making all scales performance scales. As has been 
pointed out product scales such as those of composition and 
handwriting are really nothing more than scoring instru- 
ments. Answers to reading questions may be scored as 
either right or wrong, but a whole reading test could not 
conveniently be scored right or wrong or even 2, i, or o all 
at once. A composition test is scored all over all at once. 
Those doing the scoring required some scaled scoring instru- 
ment. Product scales are such scoring instruments. After 
the scoring an additional step is possible and may be desir- 
able, namely, to take the scores of twelve-year-olds based 
upon product scales and scale them as shown in Table 24. 
The T scale should be used for all performance tests. The 
method used by Hillegas should be used in making all 



300 How to Measure in Education 

product scales. In order that comparisons between per- 
formance on all tests may be easily made, product-scale 
scores should be converted into T-scale scores. 

Method of Combining Scoring Units. — Most tests 
and scales, excepting such scales as Monroe's Standardized 
Reasoning Test in Arithmetic and Henmon's ^ French tests, 
compute a pupil's score by getting a simple total of the units 
satisfactorily done. A pupil's score on Courtis' test in ad- 
dition is the total number of examples covered. His score 
on Woody 's test in addition is likewise the total number of 
examples done satisfactorily, which, at the same time, indi- 
cates the total number of P.E. difficulties surmounted. His 
score on Thorndike's test of visual vocabulary is a measure 
of how far up a scale of difficulty he can work, i. e., his score 
is a simple total of the P.E. distances covered. His score 
on the Nassau Composition Scale or Ayres' Handwriting 
Scale is a simple total of the units of merit his own merit 
exceeds. 

Among these scales which combine units by simple total 
there are at least three distinct ways of doing it. There is, 
first, the product-scale method. The scorer slides the pupil's 
specimen of handwriting, say, up the scale until its proper , 
location on the scale is determined. The pupil certainly has 
an ability beyond that of any lower position on the scale. 

But how is it with, say, the Woody test in addition or the , 
Trabue Language Completion Scales? Does an ability to do 
an example with a P.E. difficulty of 4 guarantee an ability 
to do any other example with a P.E. difficulty less than 4? 
Not at all. It frequently happens that a pupil succeeds with 
an example whose difficulty is greater than that of examples 
on which he failed. The case might occur of one pupil who 
does the three easiest examples and misses all the rest, and 
of another pupil who does the three most difficult examples 
and misses all the easier ones. According to this second 
method both pupils would make the same score. This 

^ V. A. C. Henmon, "Standardized Vocabulary and Sentence Tests in French"; 
Journal of Educational Research, February, 1921. 



Scaling the Test. T Scale — Age Variability Unit 301 

sort of thing frequently happens in a less drastic fashion. 
However infrequent such events, these gaps are not exactly 
chasms of music to students of the theory of measurement. 

There is, third, the 20%-error method of Thorndike's 
and Haggerty's reading scales, and the 50%-error method 
of Woody's Fundamentals of Arithmetic Scales, Series A. 
These scales locate a pupil's or a class' score at that point 
on the scale where 20% of error is made or 50% of error is 
made, and to a certain extent wink at what has happened 
below this point and above this point. 

Of these three, the method of the product scale is the only 
thoroughly satisfactory one. Whether we attribute it to the 
accidents of chance or to the '4ron laws" of nature, a per- 
formance scale cannot be so constructed that pupils will 
conveniently work every test element in order of its diffi- 
culty and then suddenly break off. The difficulties of test 
elements, as determined by testing a large number of pupils, 
may not exactly fit any individual child. 

Of the last two methods mentioned, the one used by 
Woody and Trabue, namely, the number of examples done 
correctly, or the number of sentences correctly completed, 
is a far more convenient method than the 20%-error method. 
Why, then, was the latter used at all? In the first place, the 
method used by Woody and Trabue requires a scale made 
up of equal steps. Woody used the 50%-error method for 
his Series A scales because they do not progress by even 
steps. The 20%-error or 50%-error method is of such a 
nature that a scale of unequal steps does no harm. In 
Table 26 are two scales, the first with, and the second with- 
out, equal steps. The "Am't to be added" tells how much a 
pupil's score should be increased when he does the test ele- 
ment correctly. 

If, in the first scale, a pupil misses test element C he fails 
to get the one point increase, if he gets C correct his score is 
increased by one point. To lose or gain anywhere means a 
loss or gain of just one point. In the second scale a loss 
or a gain would be one point if he missed or worked A, B, C 



302 How to Measure in Education 

TABLE 26 

A Comparison of Two Scales, One of Which Progresses by Equal 
Steps and One Which Progresses by Unequal Steps 

Maximum 
First Scale Elements ABODE F G Score 

P.E. value above zero. ..1234567 

Am't to be added iiiiiii 7 

Second Scale 
P.E. value above zero. . . i 2 3 3.1 4.5 5 6 
Am't to be added i i i o.i 1.4 0.5 i 6 

or G or a loss or a gain of one-tenth or one-and- four- tenths 
or five-tenths if he missed or worked respectively D, E, or F. 
To some it does not seem reasonable to penalize a pupil 
1.4 for missing E and only o.i for missing D. In other 
words test element E has fourteen times as much influ- 
ence upon a pupil's score as test element D, although the 
casual observer would not remark any special superiority in 
element E. Again, when the steps of the scale are very 
irregular, the resulting scores may cause an injustice in com- 
paring one pupil with another. Here are the scale incre- 
ments and total score earned by two pupils: 

Score 

Pupil a I I 0.1 0.2 0.1 0.3 0.2 0.0 2.9 

Pupil h I o 0.0 0.0 0.0 0.0 0.0 0.0 i.o 

Pupil a appears to be nearly three times as able as pupil b. 
As a matter of fact their ability may easily be closer together 
than the scores indicate. Pupil b could not quite do the 
second test element. Had there been several test elements 
with difficulties between i and 2 just as there is between 2 
and 3, pupil b might have made a score of 1.9. In sum be- 
cause of the irregular steps pupil a was specially favored. 
The favoritism is, of course, not serious, but it definitely 
exists. The percentage-of-error method, especially an elab- 
oration of it used by Kelley, takes care of this irregularity. 

Monroe's Standardized Reasoning Test in Arithmetic may 
be selected as a sample of those tests which combine scoring 
units according to the method of cumulative total. Table 



Scaling the Test. T Scale — Age Variability Unit 303 

2 7 shows the difference between the methods of simple total 
and cumulative total. 

TABLE 27 

A Comparison of the Method of Simple Total and Cumulative Total 
for Combining Scoring Units 

Maximum 
Test Element A B C D E F Score 

P. E. value above zero 123456 

Am't to be added (simple total) ... . i i i i i i 6 

Am't to be added (cumulative total) 123456 21 

The difference between these two combining methods can 
be shown again by the following analogy — the computation 
of a man's height. As the illustration shows, the cumula- 
tive total is a rather Brobdingnagian one. 

Shoul- Total 

Sole Ankle Knee Hip der Crown Height 

Scale value o 3 in. 22 in. 41 in. 61 in. 74 in. 

Simple total... 03 19 19 20 13 74 in. 

Cumulative total 03 22 41 61 74 201 in. 

Another illustration does not present the cumulative total 
in such an unfavorable light. In this illustration a boy has 
agreed to carry some wood and is to be paid by the number 
of pounds of wood he carries. Here the cumulative total 
appears the natural one to use. 

Stick I Stick H Stick HI Stick IV Total 

Scale value 10 lbs. 15 lbs. 20 lbs. 35 lbs. 

Simple total 10 5 5 15 35 

Cumulative total. 10 15 20 35 80 

Thus the two methods measure different things. The 
simple total measures the height of the ability while the 
cumulative total measures the number of work units com- 
pleted. The simple total measures how heavy a log a child 
can carry. The cumulative total measures the weight of the 
logs he does carry. 

Each method is best for its purpose, but the cumulative 
total does not appear to be specially advantageous in educa- 
tional measurement. Practically the world thinks of ability 



304 How to Measure in Education 

as yielded by the simple total. In comparing the ability of 
two persons the customary mental act is similar to a com- 
parison of their heights. Properly to interpret differences 
between scores computed by the cumulative method, educa- 
tors must change their customary mental set. They must 
remember that if one pupil has a score of 4 and another has 
a score of 10, that the second pupil has done two and one- 
half more work units than the first pupil. It does not mean 
that the second pupil is two and one-half times as able as 
the first one. If this point of interpretation is kept in mind 
it makes little difference which method is used, for while 
scores computed by the two methods are not proportional 
they do correlate closely, especially if the test in each 
instance measures the maximum ability of the pupils. 

The use of the simple or the cumulative total for measur- 
ing how difficult material a pupil can work should not be 
confused with the problem of measuring speed. In most 
speed tests every test element requires approximately equal 
time for the pupil to complete it. Hence the units are in 
that sense equal and a pupil's speed is determined by the 
number of test elements completed or tried per unit of time. 
But occasionally the convenient work units in a speed test 
require quite varying times. In such instances it is neces- 
sary to determine the relative amount of time required for 
each work unit, i. e., to give each work unit a time value. 
It is quite legitimate to add up the rate values of all the test 
elements completed by the pupil in order to get a measure 
of his speed of work. But it should be observed that the 
rate value on each element does not reach back to the time 
of the starting of the entire test. 

In sum, we have seen that, with the exception of product 
scales, no scales have a totally satisfactory method of com- 
bining scoring units to make a pupiFs score. The 80-20 
per cent method devised by Thorndike is accurate enough 
for the computation of class scores but its use for this pur- 
pose requires a laborious tabulation by test elements. It is 
unsuited to the computation of individual pupil scores. An 



Scaling the Test. T Scale — Age Variability Unit 305 

elaboration of the 80-20 per cent method by Kelley is suf- 
ficiently accurate for the computation of pupil scores but 
it is too elaborate for use except for research purposes. Few 
non-technically trained individuals can understand or use 
the method. To overcome these handicaps, Trabue and 
Woody constructed scales the adjoining elements of which 
were equally far apart in difficulty. This permitted the 
computation of pupil scores by the simple process of ad- 
dition. But to secure such a scale meant that most of the 
original test material had to be thrown away as useless. 
Monroe employed an easy method of combining units which 
did not require equal steps of difficulty. But as we have 
seen the method of the simple total appears preferable to 
his method of the cumulative total. 

Thus the T-scale method was developed not only to pro- 
vide a more satisfactory reference point and unit of measure- 
ment, but also to provide a method of combining scoring 
units which yields a genuine scale score for each pupil, which 
combines units by the method of simple total, which pre- 
serves all the original test material, and which is simple 
enough to be used by non-statistically trained educators. 
All these objects were attained at one stroke by scaling the 
total score.^ Previous methods of scale construction have 
been to scale the test elements and then provide an additional 
technique for computing a pupiPs score from his success in 
dealing with these test elements. The proposed method 
makes this second step unnecessary. Scaling the total num- 
ber of questions correct or, when more than one point is 
given for each question, the total number of points made 
shows immediately the scale score corresponding to each 
total number of points, which in turn is secured by merely 
adding the points made on the different test elements. 

Combining scoring units by scaling the total score has still 
another advantage. The preparation of duplicate forms of 
a test is an extremely tedious process when conventional 

^ I have just learned that Pintner is, by a somewhat different process, scaling the 
total score for his new tests. He is constructing a separate scale for each age 
rather than a scale which cuts across the different ages as the T scale does. 



3o6 How to Measure in Education 

methods of scale construction are employed. For scales 
which secure a pupil's score by totalling the number of test 
elements correctly done, it is necessary to have each test ele- 
ment in one scale exactly matched by a test element in the 
duplicate scale of equivalent difficulty. Such an exact 
equivalence is not required in T scales. One scale may 
serve as a duplicate for another even where there is a con- 
siderable difference in difficulty between them. 



CHAPTER XI 

DETERMINATION OF RELIABILITY, OBJEC- 
TIVITY, AND NORMS 

I. Reliability 

Sources o£ Unreliability. — By reliability of a test is 
meant the amount of agreement between results secured 
from two or more applications of a test to the same pupils 
by the same examiner. Perfect reliability obtains when 
an identical examiner applies two identical or exactly dupli- 
cate tests according to an identical procedure to identical 
pupils. This last sentence indicates in brief those attributes 
which are essential to high reliability in a test, and the 
absence of which makes for unreliability. 

The first source of unreliability in a test is variation in the 
behavior of the examiner produced by causes external to the 
test itself. There are a host of causes which have the power 
to produce large or subtle changes in the personality and be- 
havior of the examiner, which behavior may in turn operate 
to raise or lower the pupil's scores. Such possible causes are 
an obstreperous pupil, a welcoming smile from the teacher, 
an indigestible lunch, etc. Chance might produce an 
especially favorable combination of causes at the first testing 
and an unfavorable combination at the second. Such a situ- 
ation would tend toward differences in results and hence 
toward unreliability. This cause of unrehability is not an 
attribute of the test itself. 

A second source of unreliability in a test is variation in the 
behavior of the examiner produced by causes inherent in the 
test itself. These causes may be in the instructions for the 
test, the method of scoring or the statistical treatment of 

307 



3o8 How to Measure in Education 

results. Perhaps the most important of these causes is inad- 
equate description. Ideally the author's description should 
reveal exactly how the examiner is to deal with every sig- 
nificant situation which may arise in the process of testing, 
scoring, etc. When an author begins the description of how 
to administer his test, in this fashion: ''See to it that all 
pupils understand what is expected of them," there is offered 
an opportunity for wide variation between different adminis- 
trations of the test. Instructions are a part of the test and 
should be just about as definite and uniform as the test 
material itself. Definite instructions to the examiner as to 
how to score with uniform rigor and how statistically to treat 
results are no less important. A study of the extent to which 
Binet Test examiners have found it necessary to carry 
standardization of procedure will give a good idea of the 
importance of this point. 

A third source of unreliability in a test is the never-ceasing 
moment-to-moment variation in pupils themselves. Like 
the examiner, each pupil is at any one moment influenced by 
a multitude of minute forces which pulse and play like mir- 
rored lights on moving water. An automobile horn, the 
lonesome howl of Jack's dog, the bleating of Mary's lamb, a 
sudden thought of the swimming hole, growing discomfort of 
a strained posture, these and a thousand other large and 
small internal and external influences register themselves in 
the pupils' scores. It is rare for the registration to be equal 
at two test periods, and as a consequence, results from two 
tests differ. It is this difference which makes the test unre- 
liable, for there is often no reason to believe that the pupils' 
reactions at one test period are more typical than at another. 

It is not, however, always fair to judge a test's reliabil- 
ity by the absolute similarity between the two scores for each 
pupil. There are certain constant causes which operate to 
produce absolute differences in results and hence make a 
test's reliability appear less than its real reliability. These 
constant forces must be eliminated or allowed for before the 
real reliability can be determined. Such constant causes 



Determination of Reliability, Objectivity, and Norms 309 

are improvement due to experience with the first test and 
due to normal growth in the measured trait. For pupils 
insist upon changing with increased age and increased ex- 
perience. Every second leaves its ever so little deposit. 
Goaded by this distracting refusal of pupils to remain sta- 
tionary, Ay res has suggested that chloroforming experimen- 
tal pupils would be a great convenience ! 

How may these constant causes be eliminated? Four 
methods have been employed. The methods of optimum 
interval, duplicate test, experimental allowance, and self- 
correlation. The first three methods aim to reveal the abso- 
lute similarity between the two scores for each pupil; the 
last method only permits a relative comparison. The opti- 
mum interval method is to allow just that interval between 
the first and the second test which brings the ability of the 
pupils most nearly to their ability at the time of the first 
test. A zero interval is impossible except for determining 
the reliability of supervisory observations and the like. A 
familiar law of nature forbids the application of two tests to 
the same pupils at the same time. Besides it is desirable 
that the two tests be given at different times to discover 
whether any pupil's score is influenced by a temporary head- 
ache or other cause of an "off day." The longer the interval 
the more any practice effect disappears. The interval must 
not be too long or the decrease in the trait due to forgetting 
will be counterbalanced by an increase due to greater 
maturity. 

In choosing the optimum interval many factors should 
be taken into consideration. The increase due to maturity 
takes place less rapidly for most traits than the decrease due 
to forgetting. Again, some tests are of such a nature that 
one pupil cannot communicate to another the ability to do 
the test successfully, nor does any pupil retain after a brief 
interval any effective memory of how to do the test. By 
the propel juggling of these factors a pupil's ability in 
many tests may be practically returned to his original 
ability. 



310 How to Measure in Education 

The duplicate test method aids the optimum interval 
method. The use of a duplicate test the second time par- 
tially avoids in the case of rate tests, and almost completely 
avoids, in the case of the difficulty tests, any increase in 
score due to practice effect. 

The experimental-allowance method is to determine ex- 
perimentally, by using a comparable group, and allow for the 
influence of all these constant causes for a given interval of 
time. 

The fourth and most convenient method of all for elimi- 
nating these constant errors is that of self-correlation. It 
may be used alone or may be aided by both the optimum 
interval and duplicate test methods. The method of self- 
correlation is to compute the coefficient of correlation be- 
tween the two series of scores secured from two adminis- 
trations of the same or duplicate tests to the same pupils. 
If this correlation is zero, the test has no reliability what- 
soever and the test is worthless no matter how many other 
good qualities it may possess. The nearer the coefficient 
approaches unity the nearer the test approaches perfect 
reliability. A later chapter shows how to compute this self- 
corr elation. The change in pupil scores due to practice 
effect or normal growth does not deflect correlation from 
unity toward zero provided the influence of these factors is 
equal for each pupil, which is substantially the case after 
any reasonable interval. 

Methods of Increasing Reliability. — How high should 
the rehability of a test be? The answer is: the higher the 
better. If the self -cor relation coefficient is zero, the test is 
worthless; if the coefficient is unity, the test reliability is 
perfect. Here are the reliability coefficients for five stand- 
ard educational tests: .55, .7, .75, .8 and .9. All uses of 
test results are based upon pupil scores, and a class score 
which is usually a mean or median of the pupil scores. A 
class score for a class of ordinary size will be sufficiently 
reliable for most purposes even though the test's self-corre- 
lation is as low as .55. But if the test scores are to be used 



Determination of Reliability, Objectivity, and Norms 311 

to make important judgments concerning individual pupils, 
the self-correlation should be above .9. Scores for indi- 
vidual pupils have some value even when the self -cor relation 
falls considerably below .9. A test whose self-correlation 
is anywhere above zero is better than nothing at all for 
measuring individual pupils. 

How may a test's reliability be increased if it falls below 
what is required for the purposes of the investigation? 
Suggestions have already been made as to how to decrease 
variation in the examiner and hence increase reliability 
through a better standardization of test procedure. An 
additional source of unreliability is the variation in pupils 
due to the operation of chance causes other than those con- 
tributed by the examiner. 

There are three ways in which these chance causes of 
unreliability may be overcome: first, by increasing the 
length of the test; second, by averaging results from repe- 
titions of the test or the test and its duplicates, and third, 
by a combination of the first and second methods. Unfor- 
tunately there is a limit to the number of times an identical 
test may be repeated owing to its increasing familiarity to 
the pupil, and this limit varies for different kinds of tests. 
In case a high reliability is desired, the existence of dupli- 
cate tests may therefore become an important factor in 
determining a test's worth. Duplicate tests are equally 
useful in preventing coaching. Just how many times a test 
or duplicate tests must be administered to secure a desired 
reliability, or just what would be the reliability of any given 
number of applications of duplicate tests may be quickly 
determined by the use of Spearman's self-correlation 
formula, whose solution is explained in a later chapter. 

II. Objectivity 

Importance of Objectivity. — A test is perfectly re- 
liable when identical results are secured from two applica- 
tions of a test to the same pupils by the same examiner. 



312 How to Measure in Education 

A test is perfectly objective when identical results are se- 
cured from two applications of the same test to the same 
pupils by different examiners. A test is perfectly subjective 
when no two examiners agree. Ordinarily the objectivity 
of a test is lower than its reliability due to the addition of a 
new cause of variation, namely, the difference in the per- 
sonal equation of the different examiners. Some tests are 
more objective than others. A test of an individual's tem- 
perature, pulse, blood-pressure, finger-length, head-circum- 
ference and the like, is usually much more objective than a 
test of his handsomeness or charm. Estimation of a man's 
height is rather subjective. The use of measuring instru- 
ments here as well as in education tends to increase objec- 
tivity. Tests are not totally subjective or totally objective. 
Objectivity, like reliability, is a matter of degree. Tests 
occupy points on a subjective-objective continuum with 
perhaps none located at either extreme. The degree of 
agreement by different examiners is the measure of a test's 
location on this subjective-objective scale. 

Objectivity is an extremely important consideration in 
the construction of tests. So important is it that there is 
little exaggeration in stating that this criterion of objec- 
tivity is the mother of scientific educational measurement. 
For educational tests are an outgrowth of the extreme dis- 
satisfaction with the subjectivity of previous methods of 
measuring the educational output. Progress in all sciences 
has been attended by a decrease in the personal equation 
through improvements in measuring instruments. Verifica- 
tion is the greatest word in the language of science. Educa- 
tion has been and still is to a large extent saturated with the 
personal equation. All progress in the development of 
education as a science is closely linked up with the creation 
of measuring instruments or measuring methods whose 
application yields verifiable results. 

How to Determine and Increase Objectivity.^ — ^A 
test's objectivity may be determined by the solution of the 
following formula: 



Determination of Reliability, Objectivity, and Norms 313 

Objectivity — reliability — personal equation. 

Objectivity is, however, more easily computed directly. A 
direct determination is made in the same way reliability 
is determined, namely, by comparing the absolute similarity 
of the two examiners' results, or by computing the coeffi- 
cient of correlation between the two resulting series of 
pupil scores. The same precautions against the same con- 
stant errors must be taken in determining objectivity as in 
determining reliability. Hence the method of self-correla- 
tion will prove as convenient for computing objectivity as 
reliability. 

How may a test's objectivity be increased? The problem 
in education is no whit different from the problem in other 
sciences. The first step in its solution is to do everything 
possible toward increasing the reliability of the test accord- 
ing to the methods sketched in the previous section. The 
second step is to determine, wherever possible, the amount 
and direction of the personal equation of the different ex- 
aminers, and to allow for them. For some time to come 
improvement in reliability will be the most convenient and 
promising method of improving objectivity. 

As with reliability, objectivity can be increased by a care- 
ful standardization of the entire testing process. If two 
examiners apply the test in different ways, disagreement is 
assured. If the method of scoring leaves room for the 
exercise of much judgment, disagreement is almost certain 
to arise. If there is a variation in the statistical method of 
computing pupil or class scores, it is hardly reasonable to 
expect results to agree. Adequate description can avoid 
most of the variation due to test application and statistical 
treatment. Much ingenuity is now being applied to develop- 
ing completely objective means of scoring pupils. 

III. Norms 

Factors Influencing the Worth of Norms. — There are 
in use two kinds of norms or standards which need to be 



314 How to Measure in Education 

distinguished, namely, standards of achievement and stand- 
ards jor achievement. The former means actual average 
achievements of age or grade groups, whereas the latter 
refers to goals or objectives for these ages or grades. At the 
last meeting of the National Association of Directors of 
Educational Research it was recommended that the former 
be called norms and the latter be called standards. When 
the objectives have been accepted by some authoritative 
national organization they may be called standards. If the 
objectives represent the opinion of the author of the test, 
they should be designated as author's standards. 

Norms are more valuable when they are representative 
of the group with whom it is most desirable to make com- 
parisons. If but one norm could be had, all would agree 
that this should be the norm for all pupils in the country. 
To secure this norm does not require us to test every child 
in the country, but it does require us to select the pupils to 
be tested in such a way that our sampling will show the 
correct proportion of each level of abihty. There would 
be more pupils tested in schools with an average social 
environment than in schools with extremely poor and ex- 
tremely good environments, because the average type of 
school is more numerous than either of the others. Simp- 
son,^ who desired nothing more than a satisfactory average 
standard for adults, tested some unusually intelligent indi- 
viduals and some equally unintelligent individuals. The 
average of the scores made by these two groups will give 
a fairly accurate central tendency for the total group of 
which these two groups represent the extremes. If we wish 
norms for the first percentile, second percentile, tenth per- 
centile, twentieth percentile and so on to the looth percen- 
tile, Simpson's method is inadequate. We need to test 
pupils selected so as to represent the proper proportion of 
each ability level. If ten per cent of the total pupil popula- 
tion are at a certain ability level, then ten per cent of the 

1 B. R. Simpson, Correlations of Some Mental Abilities; Bureau of Publication, 
Teachers College, Columbia University, N. Y. C, 1912. 



Determination of Reliability, Objectivity, and Norms 315 

pupils tested should be at this same ability level, and simi- 
larly for other levels. 

Norms are more valuable when they are stable. The 
stability of a norm is a function of the satisfactoriness of 
the sampling and the number of cases. What constitutes a 
satisfactory sampling was discussed in the preceding para- 
graph. Given this, or as nearly this as is humanly possible, 
the greater the number of cases the greater the stability of 
the norm. This statement is not exactly true, for as Pintner 
and Paterson ^ point out the perpetual multiplication of 
cases is not necessary to secure a stable norm. There comes 
a point where the addition of new cases does not materially 
influence the previous determination. What we really want 
answered is: how many cases are necessary to establish a 
reasonably stable norm? Until we have collected more ex- 
perience, there is, as Pintner and Paterson state, no way to 
tell except by averaging the scores of varying numbers of 
cases and by watching the resulting fluctuations in the aver- 
ages. When the addition of, say, 100 new cases does not 
materially alter the previously determined norm, the norm 
has stabilized. 

The number of cases required to stabilize a norm varies 
with the type of norm being stabilized. There are in com- 
mon use three kinds of norms: (a) the average performance 
norm, (b) the placemen t-of-a-test-at-a-specific-age-on-an- 
average-scale norm, and (c) percentile or variability norms. 
The placement of a test element in a grade scale is akin to 
the establishment of percentile norms. While there is a 
slight variation in practice, the usual method of placing a 
test on such an age scale as the Binet-Simon Intelligence 
Scale and the Pintner-Paterson Performance Scale is to place 
a test at that age level where 25% fail it and 75% get it 
correct. These per cents were selected on the principle that 
the 25% poorest should fail the test and the 50% normal 
and 25% supernormal should pass it. Now since more of 

2 Rudolf Pintner and Donald Paterson, A Scale of Performance Tests; D. Apple- 
ton & Co., N. Y., 1917. 



3i6 How to Measure in Education 

the pupils taken in the sampling will fall near the median 
than near the 75 percentile, and since more will fall at the 
75 percentile than at any percentile still nearer the extreme 
of the total group, more pupils are required to secure stable 
percentile norms than to place a test on an age scale, while 
the average norm is stabilized by a relatively small number 
of cases. 

Test norms are more useful when the method of their 
derivation is clearly described. This appears self-evident, 
yet it is not at all uncommon to find a statement of norms 
without any explanation as to the method of their derivation, 
whether, for example, they are mean norms or median 
norms, or 75 percentile norms, or some other kind. 

Test norms are more useful when they are reported in 
full. One author reports as norms for his test the highest 
score ever made in the test by any one class in the grade. 
This may have been done to stimulate teachers to special 
effort to bring their class up to this high standard. But such 
a stimulation is as liable to be unwholesome as beneficial 
since it may lead to overemphasis upon one subject. It is 
well to give the highest score and lowest score, or better 
the upper quartile and lower quartile scores, or, better still, 
all the percentile scores, for the fuller the norms are stated, 
the better. But whether several norms are given or only 
one norm, the best single score to report is some average 
measure. 

Test norms are more useful when they are both universal 
and local. An average norm for wide areas is useful but 
so are separate norms for a great many typical locations. 
Williams in North Carolina, Foote in Louisiana, Jordan in 
Arkansas, and other workers in southern states found that 
the spread of educational measurement in the South was 
handicapped by lack of norms from localities comparable 
with rural schools in southern states. They have begun the 
work of securing norms for the more important educational 
tests. While universality and locality of standardization are 
important considerations, it should not be forgotten that ex- 



Determination of Reliability, Objectivity, and Norms 317 

cellent norms will not make an otherwise poor test into a 
good one. 

Finally, norms are more valuable when reported for both 
age and grade. Ballard, in his recent book on mental tests, 
complains that most of the norms developed in America are 
useless in England because our norms are grade norms. Age 
norms would be almost as valuable in England as in America. 
Again, age norms permit the computation of reading age and 
Reading Quotient, spelHng age and Spelling Quotient, and 
mental age and Intelligence Quotient. Numerous instances 
of the important functions which such measures serve have 
been illustrated many times throughout this book. 

SUPPLEMENTARY READING FOR PART II 

Buckingham, B. R. — Spelling Ability; Its Measurement 
and Distribution; Bureau of Publication, Teachers Col- 
lege, New York, 1913. 

Courtis, Stuart A. — The Gary Public Schools: Measure- 
ment of Classroom Products; General Education 
Board, New York, 19 19. 

Dewey, Evelyn; Child, Emily; and Ruml, Beardsley. — 
Methods and Results of Testing School Children; E. P. 
Dutton & Company, New York, 1920. 

HiLLEGAS, MiLO B. — Scale for the Measurement of Quality 
in English Composition by Young People; Teachers 
College, Columbia University, New York, 191 2. 

PiNTNER, Rudolf. — The Mental Survey; D. Appleton & 
Company, New York, 191 8. 

PiNTNER, Rudolf, and Paterson, Donald. — A Scale of 
Performance Tests; D. Appleton & Company, New 
York, 191 7. 

RuGG, Harold O. — Application of Statistical Methods to 
Education; Houghton Mifflin Company, New York, 
1916. 

Terman, Lewis M. — The Measurement of Intelligence; 
Houghton Mifflin Company, New York, 191 6. 



3i8 How to Measure in Education 

Thorndike, Edward L. — Introduction to the Theory of 
Mental and Social Measurements; Teachers College, 
Columbia University, New York, 19 13. 

Trabue, M. R. — Completion Test Language Scales; Teach- 
ers College, Columbia University, New York, 191 5. 

Van Wagenen, M. J. — Historical Information and Judg- 
ment of Elementary School Pupils; Teachers College, 
Columbia University, New York, 191 9. 

Woody, Clifford. — Measurements of Some Achievements 
in Arithmetic; Teachers College, Columbia University, 
New York, 191 6. 

Yerkes, R. M.; Bridges, J. W.; and Hardwick, Rose S. — 
A Point Scale for Measuring Mental Ability; Warwick 
& York, Baltimore, 191 5. 

Yoakum, Clarence S., and Yerkes, R. M.—Army Mental 
Tests; Henry Holt & Company, New York, 1920. 



PART THREE 

TABULAR, GRAPHIC, AND STATISTICAL 
METHODS 

CHAPTER XII. TABULAR METHODS. 

CHAPTER XIII. GRAPHIC METHODS. 

CHAPTER XIV. STATISTICAL METHODS— MASS MEAS- 
URES. 

CHAPTER XV. STATISTICAL METHODS — POINT 
MEASURES 

CHAPTER XVI. STATISTICAL METHODS— VARIA- 
BILITY MEASURES 

CHAPTER XVII. STATISTICAL METHODS— RELATION- 
SHIP MEASURES. 



CHAPTER XII 

TABULAR METHODS 

Types of Tabulation. — Numerous practical and scien- 
tific uses of test scores require that they be preserved in 
convenient tabular form. The types of tabulation are too 
numerous to describe in detail. The more commonly used 
varieties are listed below. 

1. A tabulation showing one pupil and one test which 
is non-cumulative. Thus: 

Name Test Score 

Adams, Frank 26 

2. A cumulative tabulation showing one pupil and one 
test. Thus: 



NAME 



INDIVIDUAL RECORD CARD— ARITHMETIC 

BOY OR 



M-vrK^YxJ ^-mAXJn/ 



GIRL 



£ 



^ 



BIRTHDAY ^j . ^.g l^OT SCHOOL ^J /Aa ^^>y yi:tLln^ 



t' ^ 



^. to 



PQ 






-s 



PQ 



PQ 



PQ 



«4 



PC PC ct a: 















Fig. 4. Shows How to Fill Out the Individual Record Card for One of Courtis' 

Standard Supervisory Tests. 

.121 



32 2 How to Measure in Education 

3. A tabulation showing one pupil and one test which is 
tabulated by test elements. Thus: 

Name Test Elements Test Score 

a b c d e f g 
Adams, Frank i i i o o x x 3 

4. A tabulation showing one pupil and many tests. 
Thus: 

Name Test I Test II Test III Test IV 

Adams, Frank 8 72 41 20 

5. A cumulative tabulation showing one pupil and many 
tests. All are familiar with the cumulative record card for 
teachers' marks for an individual pupil. A cumulative 
record card for test scores would be similar. 

6. A group tabulation showing one test and revealing the 
identity of each pupil. The addition of several names and 
scores to No. i above would convert it into such a tabula- 
tion. 

7. A group tabulation showing one test tabulated by test 
elements and reveahng the identity of each pupil. The 
addition of several names and scores to No. 3 would convert 
it into such a tabulation. 

8. A group tabulation showing several tests and reveal- 
ing the identity of each pupil. Thus: 

Pupil Test I Test II Test III Test IV TestV Test VI 

a 13 32 18 42 60 5.0 

b 13 35 20 50 72 6.0 

c 10 30 12 30 60 4.5 

When the tests are very numerous the tabulations could 
be made in a quadrilled blank book where the edges of the 
leaves have been so cut away as to make it unnecessary to 
rewrite the list of names on each page. 

9. A group tabulation showing one test and concealing 
the identity of each pupil. Many of the tabulation forms 
sent out with tests provide for the tabulation of the pupil's 



Tabular Methods 



323 



score into a frequency distribution (see later chapter) 
where the pupiPs name does not appear. 

10. A group tabulation showing several tests tabulated 
into frequency distributions without concealing the identity 
of each pupil. Test I and Test II under No. 8 above, which 
were tabulated according to what Rugg calls the writing 
method, are retabulated below according to the checking 
method. 



Test I 
Pupil 

8 9 10 II 12 13 14 15 


24- 
26 


26- 
28 


28- 
30 


30- 

32 


Test II 
32- 34- 
34 36 


36- 

38 


38- 

40 


40- 
42 


a X 

b X 
C X 








X 


X 

X 








Total 00100200 











I 


I I 












When pupils are numerous statistical computation is facili- 
tated by throwing scores into a frequency distribution. The 
above method of tabulation yields almost instantaneously a 
frequency distribution as shown in the total column. This 
method of tabulation is very advantageous in extensive 
studies and only in extensive studies. 

II. Individual or group tabulation by machinery. Many 
large school systems have found it economical to install 
electrically driven machines for tabulating and sorting data. 
The Hollerith Tabulating Machine can punch on a small 
card an unbelievably large amount of data under a great 
variety of heads. The Hollerith Sorting Machine will in 
a brief time sort a large number of these cards according 
to any desired item. The machine will, in an additional 
step, count the cards in any item-group. The Powers 
machine will tabulate data upon cards in a similar fashion. 
It has an advantage over the Hollerith, machine in that it 
will sort and count at the same time, thus eliminating one 
step in the process. Since these mechanical devices can 
only be rented at a considerable charge they are not appro- 
priate for use on small studies. But where much data is to 
be handled they are a great economy indeed. 



324 How to Measure in Education 

Selection of Tabulation Form. — Which tabulation 
form to select depends upon the purposes which the data 
are to serve and the persons for whom they are intended. 
No form will serve all purposes. The following questions 
will help to prevent overlooking some vital point in select- 
ing the tabulation form or forms: 

1. Will the form make it easy to identify the pupil and 
his score or scores? 

2. Will the form permit the addition of scores from year 
to year in order that teachers and scientific students of 
education may study the progress of pupils? 

3. Will the form permit the addition of scores from 
pupils tested late? 

4. Does the test require tabulation by test elements be- 
fore individual or class scores can be computed? 

5. Will tabulation by test elements make data more 
useful to teacher? 

6. Will the form economize space and fit existing files? 

7. Will the form give a bird's-eye- view of large sections 
of the data at once? 

8. Will the form require names to be written more than 
once? 

9. Will the form easily condense into a frequency dis- 
tribution? 

10. Will the form make it easy to lose data or parts of 
data? 

11. Will the form readily reveal that a portion of data 
is missing? 

12. Will the form, in terms of both tabulation and sub- 
sequent uses, economize time? 

13. Will the form make it easy to overlook data by the 
sticking together of cards, or by a misplacement in files? 

14. Will the form make filing and refiling in alphabeti- 
cal order very laborious? 

15. Will the form require much mechanical work in the 
examination of the data? 



Tabular Methods 325 

16. Will the form permit computation on the original 
tabulation sheet? 

17. Will the form permit the tabulation of sub-totals, 
grand totals, and other summaries on the original tabulation 
sheet? 

18. Will the form facilitate the rearrangement of data 
in all desired orders without recopying? 

19. Will the form make it possible to separate any part 
of the data from other parts so that different individuals 
may be working upon the data at one time? 

20. Will the form stand the necessary wear and tear? 

21. Will the spaces on the form fit the requirements of 
the test or tests now and in the future? 

22. Will identification data such as name, sex, birthday, 
etc., have to be written more than once? 

23. Will the form permit the recording of interpretative 
scores such as quotients? 

Construction and Placement of Tables. — Experience 
has gradually crystallized into a series of conventions con- 
cerning how tables should be constructed. These conven- 
tions rest upon no particular authority, and are in fact fre- 
quently violated either from ignorance of their existence or 
to save space or for some other temporarily valid reason. A 
few of these principles are illustrated in Table 28 and are 
summarized in the following principles: 

1. The table number or letter and the title are placed 
above the table. Later it will be noted that the situation is 
just the reverse for diagrams. 

2. The title is sufficiently clear and complete to make it 
unnecessary to peruse the accompanying text in order to be 
able to understand the table. Along with a table an author 
usually has a rather full interpretation of it. All of this 
cannot go into the caption for the table. The title would be 
too bulky. It is necessary to choose for this caption the 
most essential features in the light of the most probable uses 
of the tabular data, and to state this in as concise and yet 



326 



How to Measure in Education 

TABLE 28 



Median Scores Made May, 191 7, on Woody 's Four Fundamentals of 
Arithmetic Scales, Series B, Compared With Woody's Norms 



School 
and 
Test 



N. Y. C. School 

Addition 

Subtraction . . 
Multiplication 
Division 

June Norm 
Addition .... 
Subtraction . . 
Multiplication 
Division 

N. Y. C. School 
Total 

June Norm 
Total 



Grade and Section 



III 



12.0 
8.0 

7-5 
2.1 



29.6 



B 



12.6 
9.1 
8.2 
6.0 



9.0 
6.0 

3-5 
30 



35-9 
21.5 



IV 



13-3 

lO.O 

94 
7.2 



39-9 



B 



14-5 
10.7 

11.3 
8.0 



II.O 

8.0 
7.0 
5-0 



44-S 
31.0 



B 



14.0 

lO.O 
II.O 

7.0 



42.0 



VI 



17.5 
13.7 
16.0 

II-3 



S8.5 



B 



18.1 
14.2 
17.1 

12.4 



16.0 

12.0 
iS-o 

lO.O 



61.8 



53 -o 



clear form as possible. The following question tests the 
excellence of a caption: Could the table be read and under- 
stood if it were completely removed from the text? 

This does not mean that accompanying text should be 
eliminated as useless. Alexander ^ is right when he insists 
that statistical material needs to be translated. He offers 
the following seven suggestions for good translations and 
profusely illustrates each. 

a. The illustrations and images used must be of an 
elementary nature, or at least familiar to the people for 
whom the translation is being made. 

b. In some cases it may be necessary to use several illus- 
trations in order to be sure of reaching all classes of people. 

c. Instead of representing a total by imagining an unreal 
extension of a familiar object, or by making up from 



1 Carter Alexander, School Statistics and Publicity; Silver, Burdett & Company, 
1919, 332 pp. 



Tabular Methods 327 

familiar units an aggregate so large as to be incomprehen- 
sible, it is usually better to employ some other unit. Often 
this other unit is one of time. 

d. In cost statistics, it is sometimes advisable to mini- 
mize the total by expressing it in amount per small unit of 
time, usually a trivial sum. 

e. Absolute accuracy frequently has to be sacrificed to 
force and clearness in translation. 

f. Practically all totals have to be translated through 
comparisons, using familiar objects or notions, before they 
can be understood or have much force for the average man. 

g. Many questions involving value, and particularly ex- 
hibits of loss or waste, can be profitably translated into a 
money equivalent. This is particularly true of all proposals 
involving an increase in school taxes, which must, of course, 
be addressed to the taxpayer. 

3. The descriptive items, such as the name of the school, 
the tests, the grades and the sections are placed at the left 
and top. In the preceding table the grades and sections were 
placed at the top and other descriptive items at the left. 
The grades could have been placed at the left and other 
descriptive items at the top, and under certain circumstances 
this would be preferable. In this case, however, such an 
arrangement would be less satisfactory. As the table now 
stands every descriptive item can be read from the bottom. 
Were the other descriptive items placed at the top they 
would extend beyond the limits of the page or they would 
have to be printed vertically, which makes reading difficult. 

4. Descriptive items are so placed that they may be read 
from the bottom of the page or bottom and right of the page. 
When possible, as it was in the above table, all items should 
be so placed that they may be read from one point, namely, 
the bottom of the page. The requirements of space some- 
times make it necessary to print the items at the top verti- 
cally. When this is done they should be made to read from 
the right of the page, while those placed at the left should 
be made to read from the bottom. 

5. Items and data are appropriately grouped and the 



328 How to Measure in Education 

grouping is clearly indicated by indentation, spaces, or lines. 
Thus ''Addition/' ''Subtraction/' etc., in the above table are 
placed under "N. Y. C. School" and indented. The data for 
"N. Y. C. School" is set off from "June Norm" by a hori- 
zontal space. Sections A and B are placed under the grades 
and their subordinate nature is still further indicated by 
means of lines. 

6. Sub-items are placed under super-items and are more 
indented. Thus "Addition/' "Subtraction/' etc., are placed 
under and to the right of "N. Y. C. School." 

7. The table reads downward and to the right. This is 
in part the reverse of diagrams which read upward and to 
the right. 

8. There are enough lines and rows of dots to guide the 
eye in reading the table. Lines and dots not only guide the 
eye but give the entire table a neater and more artistic ap- 
pearance. More horizontal guide lines for the above table 
would be a dubious improvement. Too many lines are as 
bad as too few. 

9. Important lines are either made double or extra. 
heavy. 

10. When a table has long columns of figures each group 
of about five figures is separated from the adjoining group 
by a space. This practice is advisable even though the table 
is not intended for publication. 

11. The print is sufficiently large to prevent eye-strain. 
This principle is particularly pertinent in connection with 
summary tables placed in the body of the text where they 
will be frequently used. Tables of original data placed in 
the appendix may legitimately have a finer print. 

12. A gap in the consecutive units is usually indicated 
by a break in the data. The above table shows very clearly 
that Grade V was not tested. This prevents one from draw- 
ing the erroneous conclusion from a hasty inspection of the 
table that there is a sudden spurt in pupils' ability in the 
fundamentals of arithmetic. 

13. Condensation of a column of figures into a total or 



Tabular Methods 



329 



average is clearly marked off by a line or space. The reason 
for this is so evident as to require no explanation. 

14. All decimal points in each column are kept in line. 
This appears such a truism as not to require mentioning. 
The statistician is rare, however, who has not spent consid- 
erable time recopying tables tabulated by presumably com- 
petent adults in such a way that not only were decimal 
points out of alignment but the whole column wound and 
twisted down even a quadrilled paper. 

15. When circumstances permit the data are more effec- 
tive when arranged in an order distribution. The first 
column of Table 29 (heavy print not in the original) illus- 
trates this principle. 

TABLE 29 

Shows the Median Sizes of Classes in a Number of High Schools 
by Subjects Together with the Range of the Middle Fifty Per 
Cent (After Bobbitt'') 



Subject 


Median 
No. Pupils 


''Zone of Safety" 


Music 


58 
32 
22 
21 
21 
20 
19 
19 
18 

17 
17 
17 
15 
14 


Pupils 
42-88 

28-55 
20-24 
18-24 
17-23 
16—22 


Physical Training 

English 


Mathematics 


History 


Science 


Agriculture 


1&-25 

15-23 
14-24 
15-20 
14-19 

13-23 
10-21 

12-18 


Commercial 


Drawing 


Modern Languages .... 
Latin 


Household Occupations . 

Normal Training 

Shopwork 







16. Important items are made prominent by the use of 
heavy type. Let us suppose that Bobbitt constructed the 
above table to show the number of pupils in English classes 

'J. F. Bobbitt, "High School Costs"; School Review, 23: 505-534- 



330 How to Measure in Education 

as compared with the number in other classes. The heavy 
type focuses attention upon English and the order distribu- 
tion shows that it holds third position among subjects. 

17. The table is placed as near as possible to the text 
which interprets it. Other things being equal, this is the 
case. Sometimes the table is placed at the beginning of 
the descriptive text, sometimes at the end. Table 28 was 
placed at the beginning of these descriptive principles so 
the reader would have in mind a concrete illustration of 
most of them. To present a fact and then explain it is 
psychologically easier to follow than to keep the reader in 
the dark until the culminating moment of the explanatory 
process has arrived. Sometimes, when the descriptive text 
is not dependent for meaning or concreteness upon the 
table, or when the descriptive text must precede to give any 
meaning to the table, the table is placed at the end or in 
the midst of the descriptive text. Sometimes, however, 
tables are and should be placed in the appendix only. Long 
tables of original data such as are found in Ph.D. theses 
would, if placed in the body of the book, tend to terrify the 
most hardened reader of statistical literature. They rarely 
illuminate the meaning of the text, and even when necessary 
for this purpose, a sampling is usually sufficient. Summary 
tables, which are really meant to be studied, should, how- 
ever, be imbedded in the text. 

The above principle is a protest against the tendency of 
some authors to put all their tables at one place, which is 
frequently far removed from the place where these tables 
are discussed. 

Additional suggestions for the construction of tables may 
be gleaned from the principles for graphic presentation 
which follow. 



CHAPTER XIII 
GRAPHIC METHODS 

Importance of Presentation. — Other chapters may 
exceed this one in length, but none exceeds it in the impor- 
tance of the topic considered. Recently posters appeared 
on New York City bill boards announcing a new play: "It 
Pays to Advertise." The poster showed a cackling hen 
leaving an egg-filled nest. For the sake of the public it is 
necessary to have a dignified title for this chapter. But it 
will not be amiss to imbed here in the privacy of the text 
the statement that the real title of this chapter is: It Pays 
to Advertise. Preceding chapters have attempted to show 
how the truth about conditions in the school may be dis- 
covered. Presumably these facts have not been collected to 
fill up files, but rather to publish in the schoolroom, at 
teachers' meetings, in public addresses, in school reports, or 
in periodicals. Presumably these facts have been collected 
to influence action — the action of pupils, teachers, super- 
visors, principals, superintendents, boards of education, or 
the public. Truth does not prevail through facts but U 
through the effective presentation of facts. 

There are three types of presentation in common use: the 
tabular, the graphic, and the linguistic. Generally speaking, 
that type of presentation is most significant which in the 
particular situation best fits the data, the purpose, the occa- 
sion, the medium of presentation, whether in an address, a 
published article, etc., and which best fits the kind of 
audience. 

The graphic method is, however, generally conceded to 
be the best method for most situations. The graphic method 
is particularly effective because when graphs are properly 

331 



332 How to Measure in Education 

made they are more easily and more quickly interpreted. 
For both these reasons, and perhaps others in addition, 
graphs have an intrinsic psychological appeal denied to 
numbers and words. It is only the unusual person whose 
tabular or literary skill is sufficient to overcome this inherent 
superiority of the graphic method. Finally, the properly 
constructed graph shows not only the graph but presents 
tabular data and utilizes linguistic description at the same 
time. The graph combines most of the advantages of all 
three methods, and is hence a powerful instrument in the 
hands of intelligent educators. 

Standard Graphic Methods. — The standardization of 
graphic methods is just as important as the standardization 
of statistical procedure. In order to further a notable 
movement toward standardization which has already begun 
and in order to give the reader an introduction to graphic 
methods the full preliminary report of the Joint Committee 
on Standards for Graphic Presentation is given below. 

Joint Committee on Standards for Graphic Presentation 

Preliminary Report Published for the Purpose of 

Inviting Suggestions for the Benefit of 

THE Committee 

As a result of invitations extended by The American 
Society of Mechanical Engineers, a number of associations 
of national scope have appointed representatives on a Joint 
Committee on Standards for Graphic Presentation. Below 
are the names of the members of the committee and of the 
associations which have cooperated in its formation. 

WiLLARD C. Brinton, Chairman, American Society of Mechanical Engineers. 

7 East 42 d Street, New York City. 
Leonard P. Ayres, Secretary, American Statistical Association. 

130 East 2 2d Street, New York City, 
N. A. Carle, American Institute of Electrical Engineers. 
Robert E. Chaddock, American Association for the Advancement of Science. 
Frederick A, Cleveland, American Academy of Political and Social Science. 
H. E. Crampton, American Genetic Association. 
Walter S. Gifeord, American Economic Association. 



ri 



Graphic Methods 333 

J. Arthur Harris, American Society of Naturalists. 

H. E. Hawkes, American Mathematical Society. i 

Joseph A. Hill, United States Census Bureau. 

Henry D. Hubbard, United States Bureau of Standards, 

Robert H. Montgomery, American Association of Public Accountants. 

Henry H. Norris, Society for the Promotion of Engineering Education. 

Alexander Smith, American Chemical Society. 

JuDD Stewart, American Institute of Mining Engineers. 

Wendell M. Strong, Actuarial Society of America. 

Edward L. Thorndike, American Psychological Association. 

The committee is making a study of the methods used in 
different fields of endeavor for presenting statistical and 
quantitative data in graphic form. As civilization advances 
there is being brought to the attention of the average indi- 
vidual a constantly increasing volume of comparative figures 
and general data of a scientific, technical and statistical 
nature. The graphic method permits the presentation of 
such figures and data with a great saving of time and also 
with more clearness than would otherwise be obtained. If 
simple and convenient standards can be found and made 
generally known, there will be possible a more universal use 
of graphic methods with a consequent gain to mankind be- 
cause of the greater speed and accuracy with which complex 
information may be imparted" and interpreted. 



334 



How to Measure in Education 



THE FOLLOWING ARE SUGGESTIONS WHICH THE COMMITTEE 
HAS THUS FAR CONSIDERED AS REPRESENTING THE MORE 
GENERALLY APPLICABLE PRINCIPLES OF ELEMENTARY 
GRAPHIC PRESENTATION 

Population 

100,000.000 

80,000.000 

_, , , 60.000,000 

I. The general arrangement of a 

diagram should proceed from left to 40»COO.OOO 
^^^ ^' 20.000.000 



0l 




S Jo »^ •? ^ 

Year 



o o 



Fig. s 



Year 


Tons 


1900. 


270,588 


1914. 


555,031 




dD aa 



Fig. 6 

2. Where possible represent quantities by linear magnitudes 
as areas or volumes are more likely to be misinterpreted. 



3. For a curve the vertical scale, 
whenever practicable, should be so 
selected that the zero line will appear 
on the diagram. 



Safes 
$ 1000 

900 

eoo 

700 
600 
500 
400 
300 
200 

too 






















— 






































/ 


^ 


n 


■^ 




«» 


> 










/ 














^ 




»*" 
























































































































^ 




^ 


^ 



1 2 3 4 6 6 7 a 9 to 1112 

Months 



Fig. 7 



Graphic Methods 



335 



4. If the zero line of the ver- 
tical scale will not normally ap- 
pear on the curve diagram, the 
zero line should be shown by the 
itse of a horizontal break in the 
diagram. 




Population 

100.000.000 

60.000.000 
€0.000,000 
4O.Q00.000 
20.000.000 







qAo o 





Fig. 9A 



5 10 tSiZO 25 30 35 
Miiesfpertln 



Fig. 9B 



5. The zero lines of 
the scales for a curve 
should be sharply distin- 
guished from the other 
coordinate lines. 



l05& 

^2,000 
1,500 
1.000 
500 



SOD 
1.000 
1»500 




1^ 



«• c^ ?o ^ tn o r^ 
.0 j» ,0 .0 j3 ,0 j» 

Year 

Fig. 9C 



336 



Utilized 




How to Measure in Education 



Relative 
Cost 

104 



o o o O O O O 
lO jO /* jp o> o ~ 



Year 

Fig. xoA 




oooSoppoppgi 



d>' 



Year 

Fig. ioB 



6. For curves having a 
scale representing percent- 
ages, it is usually desir- 
able to emphasize in some 
distinctive way the lOO 
per cent line or other line 
used as a basis of com- 
parison. 



7. When the scale 
of a diagram refers to 
dates, and the period 
represented is not a 
complete unit, it is 
better not to empha- 
size the first and last 
ordinates, since such 
a diagram does not 
represent the begin- 
ning or end of time. 



Per Cent 

o^ 

People 

100 
90 

80 
70 
60 
50 
40 
30 
20 
10 



^ 


■■ 


■■ 






■■ 




^ 


^ 


7 














/ 




/ 












A 


/ 




/ 












I 


/ 




/ 














/ 




/ 














7 




/ 














. 


r 


/ 
















i 


/ 
















"J 


^ 


















p 


















^^ 






OQ SS 



PerCentof Incomd 

Fig. loC 



Population 

100.000.000 
80.000,000 

eo.Qoo.ooo 

40,000.000 

20.000.000 









Pk 



g S ,0 ^ « o> s •- 

" Year 

Fig. II 



Graphic Methods 



337 



8. When curves are drawn on 
logarithmic coordinates, the limit- 
ing lines of the diagram should 
each be at some power of ten on 
the logarithmic scales. 



Population^ 

lOO.OOO.OOO' 



tO^OOCGOO 



■^ 



1.000,000' 






~1 r^S^ S^rr ynSD ^M ^ 



Year 

Fig. 12 



Populaflon 

100,000.000 
80.000,000 
60.000,000 
40,0Q0.000 
20.000.000 





Population 

100,000.000 



80.000,000 
60.000,000 
40.000,000 
20.000.000 





8 g ^^ S 8 S 2 

Year 

Fig. 13A 

g. It is advisable not to show any more coordinate lines than 
necessary to guide the eye in reading the diagram. 

Population 

100,000,000 

80.000.000 

10. The curve lines of a 60.000.000 
diagram should be sharply 

distinguished from the rul- ^^^^^^^^^^ 

ifi^g' 20.000.000 




33S 



How to Measure in Education 



Population 

1OD.OOO.O0O 
80,000,000 
60.000,000 
40,000.000 
20.000.000 





ao o o o o o 
«o p" 80 oi. o — 

\feir' 

Fig. 15A 



Analysis 

%Ash 

10 



^iiiiin 



10 15 20 25 30 

Fig. 15B 



II. In curves representing 
a series of observations, it is 
advisable, whenever possible, 
to indicate clearly on the dia- 
gram all the curves represent- 
ing the separate observations. 



Pressure 
Lbs.perSq.In. 

200 
175 
150 
125 

too 

75 

50 

25 





X- 



F\ 



000000 

So o o o o 
10 r^ 00 a> o 



o o o Q 
0000 



Speed R.P.M. 
Fig. 15C 



12. The horizontal scale for 
curves should ttsually read from 
left to right and the vertical 
scale from bottom to top. 



Population 

100.000.000 
60.000,000 
60.000.000 
40JOOO.OOO 
20,OQO»000 







.8 ,8^ ,8 ^ f ? 
Year ' 



Fig. x6 



Graphic Methods 



339 



Popufaffon 

100.000.000 
80,000.000 
60.OCO.000 
40.000.000 

20.000.000 



S 55 f* 3 Q> o ^»" 

VeBP 



Gain 

OP 
Loss 

% 2.000 

1,500 

I.OCO 

500 



500 
1.000 
































y 












[/^ 


r 


6^ 


^ 




mA 


y 








r^ 


r 





































O ^ <sj (O 5 

o _o o o o 



fcO lO ^ 

o o o 



Yean 



Fig. 17A 



Fig. 17B 



\-l 

1 I I I J ■ 1 III 



L 



~\ 



Fig. 17C 



13. Figures for the scales of a diagram shotdd be placed at the 
left and at the bottom or along the respective axes. 



340 



How to Measure in Education 



r 



«o 



c* 



.<^<d^ <p «o^ K^ r*, oj. *o^ «^ 

>^y W — ' o* ^S" ifT cj" <f" £j 

O'^ f^ to" —* of Q f/ irf" ^ 

>/^ U c* lO »o irt 

1OO.O0O.O0O 
80.000.000 
60.000,000 
40,000.000 
20.000.000 



Year 

Fig. i8A 



r 















/ 










nf 


/ 










/ 


/ 






_^ 


^ 


y 










*<*" 














2 c 

s -^ 


B i 


D O O CD O C 

D r~ 00 o^ ^ j- 



. a)f«.o*u*» €u«DO«o«-*** 



\;r3Pit85g5?^52tSg 



180 

teo 

140 

120 

100 

80 

60 

40 

20 








































































































/ 




*" 


V 














^^ 


f 






> 


L 


^ 


•*. 






J 


/ 










s 






\ 


«, 


/ 








































na 


■oLb 




jH^jl^ 


OB 


oa 





\ 23456789 10 II 12 

Month 
Fig. i8B 

Y 

50 



40 

14. It is often desirable to include 
in the diagram the numerical data or 20 
formidce represented. 







T 














I 




J 












\ 




/ 














7 










^ 


^ 


^■^ 













\ ; 


Fi( 


X 


8C 


5 ( 


B • 


7 



Population 

100.000,000 
75. // numer- 
ical data are not SO.OOO.OOO 
included in the 
diagram it is de- 
sirable to give the 
data in tabular 
form accompany- 
ing the diagram. 



60,000,000 
40.000.000 
20.000.000 



r 



r^ CO O) 

Year' 



Fig. 19 



Nfear 


Population 


1840 
1850 
1860 
1870 
1880 
1890 
1900 
1910 


17,069,453 
23,^91.876 
31,443.321 
38,558,371 
50.155,783 
62,622,250 
75,994.575 
91,972.266 



Graphic Methods 



341 



16. All lettering and 
all figures on a diagram 
shotdd be placed so as to 
be easily read from the 
base as the bottom, or 
from the right-hand edge 
of the diagram as the 
bottom. 



Population 

100.000,000 

80,000,000 
60,000,000 



20,000.000 



BXBS 



OoOOOOOO 

Stfj to r- w c o •- 



Year 



Fig. ao 



i^. The title of a diagram should 
be made as clear and complete as 
possible. Sub-titles or descriptions 
should be added if necessary to insure 
clearness. 



,<^«* cJ u> «o W d d — . oD r^ S o 




t a 3 4 5 6 7 e 9 10 II 12 

Month 

Aluminum Castings Output of 
Plant No. 2, by Months, 1914- 

Output is given in short tons. 

Sales of Scrap Aluminum are 
not included. 

Fig. 21 



Further Principles of Graphing. — The suggestions 
given below do not appear in the report of the Committee 
on Graphic Presentation, but through the influence of Brin- 
ton's book,^ in particular, they have become rather gen- 
erally accepted as good practice. The reader is referred to 
his book for a further amplification and illustration of these 
principles. 



1 Willard C. Brinton, "Graphic Methods for Presenting Facts"; The Engineering 
asine Co., New York, 191 7, 37 1 PP- 



342 How to Measure in Education 

1 8. When several items are being compared the item of 
chief interest may be made more striking than the others. 

The most important item can be made more striking by 
the use of (a) capitals or red letters for the title. Thus in 
Fig. 6, for example, the ^'1914" and the ''555,031" could 
have been printed in red, provided the year 19 14 had some 
peculiar importance. If a principal were comparing his 
school with other schools he would make the title of the bar 
representing his own school red, or capitalize the title of his 
school. If, on the other hand, several schools are being 
compared with standard, the standard would be made red 
because the standard would be the most prominent item. 

The important item could be made more striking by the 
use of {b) a solid bar for the important item and an out- 
lined bar for the secondary items, or by the use of (c) a 
heavier bar or curve for the important item, or by the use of 
{d) a colored bar or curve for the important item. If 
desirable and undesirable items are being compared and 
more than one color is used, it has become a practice to 
represent the undesirable items by red and the desirable 
items by green. 

19. Popular features or ^^eye catchers^' may be used to 
attract attention to the diagram but may not, as a rule, be 
an integral part of the diagram. 

If the diagram concerns the cost of producing a given 
unit of growth in pupils large $'s will help to attract atten- 
tion, but they should accompany the diagram and not be 
a part of it. That is, no attempt should be made to show 
the cost by the number of $'s. 

20. Do not place captions or numbers so as to alter the 
length of bars or to interfere with a visual comparison of 
their length. 

This means that all numbers should appear at the left of 
the bars, unless the bars are drawn vertically, in which case 
the numbers may appear at the top of the bars written hori- 
zontally. 

Were the numbers shown to the right of the bars in Fig. 



Graphic Methods 343 

6 instead of at their left and were the tons for the lower 
bar a million or more the 1914 bar would be made to appear 
longer than it really is, due to the longer length of the num- 
bers representing tons. The caption for each bar could also 
be so placed as to produce a like illusion. 

2 1 . When a scale (time scale especially) is not consecu- 
tive indicate the gap by a wider -than-usual space interval. 

Suppose there were a column of five bars like those of 
Fig. 6, the top one showing the score made on a test by 
Grade III and the bottom one showing the score made by 
Grade VIII. Suppose further that there is no score or bar 
for Grade VII. The omission of Grade VII should be indi- 
cated by a relatively wide gap between the sixth and eighth 
grade bars. Otherwise the reader is likely to be misled into 
thinking there is a point in the elementary school where 
there is an exceptionally rapid growth. 

22. In graphing two or more bars or curves for compari- 
son make their zero lines coincide. 

Anyone who has ever drawn straws to determine who shall 
get the only apple, or pay for the drinks, knows that he 
must be suspicious of the apparent length of the straws. We 
are never sure of our comparison until we discover the zero 
point of each straw. It is necessary to be equally suspicious 
of graphs whose zero points are not clearly revealed. 

23. Do not use a percentage curve when it is wished to 
show the actual amounts of increase or decrease and do not 
use an amount curve when it is wished to show the per cents 
of increase or decrease. 

Either a curve must be drawn on a logarithmic scale in 
order to show both amounts and per cents of change or else 
two graphs are required, one to show amount and one to 
show percentage. 

As to comparable scaling, it is well to remember that of 
two curves plotted to the same scale and whose variability 
is identical, the upper curve will appear to have larger 
fluctuations. Statisticians are familiar with the notion that 
the variability of two sets of data cannot well be compared 



344 How to Measure in Education 

until the variability of each has been divided by the average 
of the data from which each variability was computed. 
This means that the larger the data is numerically the larger 
will be the amount of fluctuation, even when the percentage 
of variation remains constant. When it is wished to com- 
pare the fluctuation of two curves on the same graph, one 
of which represents numerically small amounts and the 
other numerically large amounts, convert the amount curves 
into percentage curves and interpret in the light of the 
original absolute amount of each. 

24. Use a diagram which is appropriate to the data to be 
presented. 

What diagrams to use in a given situation is discussed 
below. 

Types of Diagrams. — There are a bewildering variety 
of diagrams, some good, some bad. And there is an un- 
limited number of graphs which may be classed as cartoons. 
Such, for example, is a drawing showing which of a pupiPs 
neural pathways are in action when he is adding, or a 
drawing which pictures the number of germs in the water 
where pupils swim or any other of the numerous pictographs. 
The value of such cartoons usually disappears with use and 
hence they are not appropriate material to consider here. A 
ride on a street car, or a brief study of bill boards will give 
enough suggestions of cartoons to use. To standardize them 
would be to destroy their value. 

Most of the standard diagrams are variations upon a 
few simple types. The few types listed below will be found 
adequate for most persons and most purposes. If any reader 
plans to do a great deal of graphing he should consult some 
special treatise on the subject, such as Brinton's. 

Type I. The sector diagram. Thus far in this book no 
illustration of the sector diagram has been printed. One is 
given in Fig. 22. 

The construction of a sector diagram is exceedingly 
simple. There are 360 degrees in the circle. Sixty-six per 
cent of 360 degrees is 237.6 degrees. The 237.6 degrees 



Graphic Methods 345 

may be roughly estimated with the eye or more accurately 
measured with a protractor. The other sectors are deter- 
mined in a similar fashion. The diagram would be much 
more striking if each sector were colored to fit the race which 
the sector represents. 

Type 11. The bar diagram. See, for illustration, Fig. 6. 

Type III, The sectioned-bar diagram, (a) without sub- 




FiG. 22. Distribution by Race of the Pupils in Grades 
III Through VIII of a Public School in an 
Eastern City. 



divisions, and (b) with sub-divisions of the component parts. 
The top bar of Fig. 23 illustrates the sectioned-bar diagram 
without sub-divisions, while the entire figure illustrates the 
diagram with sub-divisions. 

This diagram uses such a design in each section as to 
make it appear distinct from the adjoining sections. This 
plus the sectioning makes it clear at a glance that in this 
particular school the per cent of pupils attaining standard 
gradually increases with progress through the grades. A 
larger percentage of girls attain standard than boys. With 



346 



How to Measure in Education 



progress through the grades the percentage of boys who 
attain standard gradually increases relative to the percentage 
of girls. In Grade VII the boys' percentage has reached 
the percentage of the girls. 

The unique combination of the bar and sectioned-bar 
diagrams shown in Fig. 24 is not only unusual but also 
unusually effective. 



TOTAL, 100% 




GIRLS 5 59% H ^^ilSJI^B 





BOYS, 41% 



Fig. 23. The Per Cent Which the Number of Pupils in Each Grade Was of the 
Total Number of Pupils in All Grades Who Attained Woody's Norms Accord- 
ing to a Random Sampling of 300 Boys and 300 Girls in a New York School. 

Type IV. The frequency surface. Figs. 26 and 28 may 
be examined as illustrations of frequency surfaces. 

Type V. The curve diagram. Numerous illustrations 
appear in the Report of the Joint Committee on Standards 
for Presenting Facts. Fig. 5 may be inspected as a sample. 

Practically every diagram listed above, except the sector 
diagram, is a bar diagram or some variation on this basic 
type. The sectioned-bar diagram is merely a bar diagram 
divided into component parts. The frequency surface is 
merely a series of bar diagrams placed close together and 
in a vertical position. A curve diagram is merely a series 
of non-adjoining narrow vertical bars which are connected 
at the tops with a continuous line or curve. 

A special form of the curve diagram frequently used by 
mental measurers is the psychograph or mental profile. If 
the zero line in Fig. 9C represented the standard scores on 



Graphic Methods 



347 



- 1 LIBRAR ES 1 


cp 

Z8 


o 


o 
O 

• 

UJ 


o 

5 

UJ 

o 

UJ 


UJ 
UJ 


«— 

O 
Ll. 


HIGHWAYS 


UJ 

< 

o 

iWlliU'liBlillll 

1 

2 

3 


1 


1 


1 


I 


1 


1 


1 


P> 


2 


2 


2 


2 


2 


2 


2 


3 


3 


3 


3 


3 


3 


3 


4 
5 

6 
7 


4 


4 


4 


4 


4 


4 


4 


5 
7 


5 


5 


5 


5 


5 


5 


5 


6 


6 


6 


6 


6 


6 


6 


7 
9 


7 


7 


7 


7 


7 


7 


S 
9 


8 
9 


8 

10 


8 


8 


8 


8 


8 


9 


9 


9 


9 


9 


10 
II 


10 


10 


pcilpQl 


10 


10 


10 


II 


li 


II 


11 


11 


FJMC3 



Fig. 24. Rank of Cleveland among Eighteen Cities in Expenditure for Operation 
and Maintenance of Schools. (After L. P. Ayres, The Cleveland School 
Survey, 1916.) 



348 How to Measure in Education 

several mental tests and the dates shown at the bottom were 
each the name of a mental test, then the second curve would 
show a sort of mental profile of an individual or group. 

Selection of Diagram to Show Component Parts. — 
Frequently in educational measurement it is necessary to 
show what part each of various components is of the whole. 
In order to assist an audience to properly interpret certain 
test results it may be necessary to show what per cent of the 
total number of pupils in a school system belongs to the 
White race, Black race, etc. It may be necessary to show 
what per cent of pupils in Grade IV of a certain school are 
eight, nine, ten, or eleven years of age. It may be desirable 
to show how many or what per cent of the pupils, or schools, 
or cities make various scores on the test. All these are 
situations involving component parts of a whole, and require 
a diagram appropriate to component parts. 

Perhaps the simplest of all diagrams for showing com- 
ponent parts is a sector diagram such as is shown in Fig. 22. 
The sector diagram would serve for any situation listed in 
the preceding paragraph. 

The sectioned-bar diagram shown in Fig. 23 is an even 
better graph for presenting component parts. It is in almost 
every respect superior to the sector diagram. Visual com- 
parisons of the components are easier. The direction of all 
lettering is uniform. The numerical data can be so placed 
that numbers and decimal points are directly under each 
other, so that the addition of any or all components is greatly 
simplified. The sector diagram is not nearly so flexible. It 
will satisfactorily show only one series of components. The 
sectioned-bar diagram will show one or more sub-divisions 
of components. Hence, except in the situation noted below, 
the sectioned-bar diagram should usually be preferred to 
other diagrams for showing component parts. 

When it is wished to show the number or per cent of 
pupils making various scores or who are of various ages, or 
in any situation where the unit is a consecutive numerical 
fact such as scores, ages, dates^ and the like^ the frequency 



Graphic Methods 



349 



surface is the most convenient graph, although any of the 
others could be used. 

There are several useful variations of the frequency sur- 
face. Fig. 25, for example, reveals not only the number of 
schools making various scores on a test but also the identity 
of each school making a given score. Again, a frequency 
surface will show sub-divisions of component parts, in which 
case the graph really becomes a series of vertically arranged 
sectioned-bars. 



^lTfe3n^iTT 



62 



32 



15 



17 



37 



58 



43 



52 



35 



24 



33 



35 



59 



5 



2S 



49 



76 



13 



36 



54 



75 



83 



87 



S 



12 



2342 



31 



40 



46 



88 6478 



22 
25 
26 
29 



55 



3472 



74 
77 



21 



16 
18 

44 
5i 



93 



_2 
4 

20 

39 
50 



56 



27 



5738 



47 



10 



8414 



45 



46 



67 



60 



92 



9i 



95 



68 



90 



69 



79 



66 



94 



71 



65 



19 



43 6465 66 67 68 63 70 71 11 73 74 7576 77 7« 731011 82 8384 8586 

Fig. 25. The Number of Cleveland Elementary Schools and the Identification 
Number of Each School Making Various Average Scores in Spelling. (After 
C. H. Judd, Measuring the Work of the Public Schools, Russell Sage Founda- 
tion, N. Y., 1916.) 



Selection of Diagram to Show Comparisons. — For 
simple comparisons, the best diagram is the bar diagram, an 
example of which is shown in Fig. 6. The bar diagram can 
be used to advantage in such situations as the following: 
where it is necessary to compare {a) the number of pupils 
in one grade with the number of pupils in another grade, {b) 
the norm on a test for, say. Grade III with the median score 
made by a class in Grade III, (c) the score of a grade in 
one school with the score made by each of several similar 
grades in other schools, {d) the median score made by one 
grade with the median scores made by each of several 
grades, (e) the score made by one pupil in a class with the 



350 How to Measure in Education 

score made by each of the other pupils. No matter how 
numerous the items, wherever only simple comparisons are 
involved the bar diagram is thoroughly satisfactory. 

Special variations on the diagram can be made to suit 
special situations. If, for example, the scores of pupils in 
a class be represented by a series of horizontal bars, one 
vertical bar can show the median for the class, another can 
show the norm, thus making possible a comparison of each 
pupil with every other pupil, with the median for the class, 
and with the norm. 

A bar diagram is not satisfactory for comparing two 
different series of components. The sector diagram, how- 
ever, will permit such a comparison. If we were to place 
beside Fig. 22 another circle of equal size showing similar 
facts for another school system it would be possible for the 
eye to roughly compare the sectors. Other graphs, however, 
permit an easier and more accurate comparison. 

The sectioned-bar diagram will show comparisons be- 
tween two series of data better than the sector diagram. 
If we were to place one or more graphs, showing similar 
data, for another school directly under the top bar of Fig. 
23 the eye could, with some difficulty, compare the length 
of one section with the corresponding section. Comparison 
is made difficult by the fact that the beginning points of all 
corresponding sections are not directly over each other. 

The frequency surface is even more useful than the sector 
or sectioned-bar diagrams for comparing series of com- 
ponents. Fig. 2 illustrates such an use. Here the frequency 
surfaces are placed one above the other. When not more 
than two series of components are being compared the two 
frequency surfaces may be placed on the identical base line. 
When there are more than two surfaces on the identical base 
line the overlapping becomes too confusing to be useful. 

The curve diagram is the most useful of all graphs. Its 
prevalence in the Report of the Joint Committee on Stand- 
ards for Presenting Facts is a sort of index of its utility. 
Before finally choosing the type of diagram for presenting 



Graphic Methods 351 

his data the reader will do well to go through the charts of 
the Joint Committee to see if some curve which he finds 
there may not satisfy the condition of his data. The curve 
is familiar to most persons; it is easily and quickly read; it 
is so flexible that almost any data can be presented by means 
of it. 

The curve is particularly effective for comparing two 
series of similar data. Suppose we have a curve showing 
the progress of the medians from grade to grade of a certain 
school on a certain test. One or more other curves repre- 
senting the grade progress of other schools on the same test 
may be drawn on the same diagram, thus permitting easy 
comparison. 

Curve diagrams may also be used to compare series of 
components. The curve diagram could take the place of 
the overlapping frequency surfaces in Fig. 2. When a fre- 
quency surface is made with a series of rectangles as in Fig. 
2 it is called a histogram. When a frequency surface is made 
with a continuous curve it is called a frequency polygon. 
All that is needed to convert the histogram into a frequency 
polygon is to draw a continuous line which passes through 
the middle point of the top of each rectangle and then erase 
the lines which block out the rectangles. In practice, the 
frequency polygons are drawn directly from the data. 

The curve is equally preeminent for showing the relation- 
ship between two series of data. A curve of grade progress 
shows the relationship which obtains between grade and 
score on a test. A curve of age progress shows the relation- 
ship between the age of a pupil and score on a test. Fig. 
1 5C is an illustration of how the curve type of chart may be 
used to give a graphic picture of correlation. 

Preparation of Diagrams. — The following materials 
are either essential or useful in charting: appropriately 
ruled paper or plain paper to be ruled by the person mak- 
ing the chart, drawing board, T-square, decimal scale ruler, 
French curves, reducing glass, colored crayons, waterproof 
India ink of various colors, gummed letters and figures. 



352 How to Measure in Education 

Still other appliances would be useful but few persons out- 
side of professional draftsmen have half the material already 
mentioned. 

The material upon which the diagram is drawn will vary 
with circumstances. When test records are kept on file 
from year to year it will be found advisable to make dia- 
grams for these files on ruled cards of uniform filing size. 
For lecturing purposes the chart may, by means of a brush, 
be drawn in white paint on black cambric cloth. When 
the paint has dried this cloth can be folded and packed 
into a small space in a handbag. 

The charts should be drawn in harmony with the sugges- 
tions already presented. Besides this, the diagram should 
be neat with all lettering as plain as possible. Gummed let- 
ters and numbers may be used to produce a clearer and 
neater picture, if the person making the diagram is not 
skilled in making letters and numbers. If the diagram is 
intended for publication it is advisable to make the drawing 
larger than it will be when published, in order that in the 
process of reduction to printing size minor irregularities will 
disappear. If the graph is made just twice the printing size 
great care must be taken to see that every proportion of 
the original is exactly twice the size that is finally desired. 
Coordinate and other lines must be twice as wide, and twice 
as far apart. Letters and numbers must be twice as high 
and wide and so on. These proportions may be determined 
by general judgment, by use of the reducing glass, or by 
actual measurement. Finally, the diagram should be drawn 
in India ink in order that it may give a clear photograph. 
Black, red, green, and blue India inks all photograph black. 
Black prints blacker than any other color; red is a close 
second, and the others in the order named. 

Reproducing the Diagram.— If the diagram is in- 
tended for local use it may be reproduced on a hectograph 
or mimeograph at very little expense and with very little 
trouble. If this method of reproduction is used the diagram 
must be prepared with a special kind of ink in the former 



Graphic Methods 353 

case, or on a special stencil in the latter case. Adequate 
instructions for this process come with these reproducing 
instruments. A school can ill afford to be without either a 
hectograph or mimeograph. 

The blue print is another method of speedy and inex- 
pensive reproduction and so is the photostat machine. The 
photostat machine will make direct photographic copies of 
diagrams. Blue-printing and photostat companies will be 
found in most large cities. 

The stereopticon, reflectoscope, and motion picture may 
be considered reproducing machines. There are companies 
who will convert any diagram into a lantern slide whose use 
in connection with a stereopticon will throw the diagram on 
a screen. Many schools are finding the stereopticon an in- 
dispensable adjunct. There are portable stereopticons 
which may advantageously be taken on lecture tours. Re- 
fiectoscopes are made which will reflect a diagram directly 
from the paper drawing. This saves time and expense in- 
volved in having lantern slides prepared but it is not so 
satisfactory in other respects as the stereopticon. All are 
familiar with the motion picture machine. 

If a diagram is published one of three methods may be 
employed, {a) a zinc or line cut, {b) half-tone or copper 
plate, or {c) Ben Day. The zinc cut is the cheapest, the 
half-tone next, and the Ben Day process is the most ex- 
pensive. As stated before, diagrams are usually more effec- 
tive when printed in color, but color printing is very 
expensive indeed. Before the diagram is sent to the pub- 
lisher instructions as to the process and the final dimensions 
desired should be noted on the margin, preferably in blue 
pencil since such markings do not photograph. If the 
process is Ben Day the shading desired for each portion of 
the graph should be selected from a catalog and indicated. 



CHAPTER XIV 
STATISTICAL METHODS— MASS MEASURES 

Three T5^es of Mass Measures.— The first step in a 
statistical study of scores is to convert the data into one or 
more mass measures. This step is the continental divide 
between correct and incorrect statistical procedure, and 
should not be omitted even where it is not necessary to 
subsequent computation. What statistical measure to com- 
pute, whether to compute any measure at all, how to inter- 
pret the statistical measures when computed, all three 
questions depend for their answer in part upon one of the 
following three mass measures, especially the first or 
second. 

I. Frequency Surface. 
II. Frequency Distribution. 

III. Order Distribution. 

I. Frequency Surfaces 

Normal Frequency Surface. — Suppose we ha^^e the 
following table of scores: 

TABLE 30 

Specially Chosen Scores Made by Sixty-four Fourth-Grade Pupils 

on a SpelHng Test 

15 17 14 19 14 

II 16 II 15 II 

11 9 15 10 19 
13 17 18 7 12 

9 10 16 16 13 

12 14 12 II 7 
8 18 17 17 14 

10 12 12 14 9 

13 15 16 16 14 

13 18 13 13 13 
10 10 II 20 12 
15 13 8 8 6 

14 15 12 9 

354 



Statistical Methods — Mass Measures 355 

In their present form these scores are practically unin- 
terpretable, but observe the difference when they are turned 
into the frequency surface of Fig. 26. 

How to Construct the Frequency Surface. — 

1. The base line of the graph is drawn. 

2. The smallest score in the table is 6 and the largest 
is 20. Beginning at the left, since the left always means 
low scores, 6, 7, 8 and so on to 20 are written under suc- 
cessive vertical lines. 

3. The first score in the table is 15. Now for a pupil to 
get a score of 15 in this test means that his true score is 
somewhere between 15.0 and 15.999, ^tc. Hence a dot is 




SCORE 67 89 I0IM2 13 14 15 16 17 18 19 20 

FREQUENCY 123456787654321 

Fig. 26. An Approximately Normal Frequency Surface. 
(Data from Table 30.) 

placed in the square just above and to the right of 15, i. e., 
in the square just above the distance 15.0 — 16.0. The 
next score is 11 and a dot is placed just above and to the 
right of 1 1 . The next score is also 1 1 , but instead of plac- 
ing two dots in one square, the dot is placed in the square 
immediately above the square holding the preceding dot. 
In this way one square represents one person. This process 
is continued until every score is checked into its own 
proper square. 

4. The checked squares are traced with a boundary line 
in block fashion, and we have the frequency surface of 
Fig. 26. 

How to Read the Frequency Surface. — Fig. 26 re- 



356 



How to Measure in Education 



duces confusion to order and permits one quickly to grasp 
the total condition of the grade. Besides this, the graph 
tells the following story almost at a glance, (i) The 
smallest score is 6 and the largest 20. (2) There are one 
score of 6, two of 7, three of 8 and so on. (3) The test 
has been neither too easy nor too difficult, for the scores 
do not pile up at either the low or high end of the distribu- 
tion. (4) The variation in scores from 6 to 20 is continu- 
ous. (5) The distribution of scores is symmetrical. (6) 
There is but one central tendency or mode, hence the sur- 





























































..;_:; : :. : :: zSa-- - Jni::;:: : : ::::.: 


_ri_ h 






































J — ri Til 1 1 1 1 1 1 1 1 1 1 rr? ^r-,- 



Fig. 27. A Normal Frequency Surface. 

face is unimodal, i. e., most of the scores tend to cluster or 
swarm about one mode or point. (7) The crude mode is 
at 13 which has a frequency of eight. (8) The frequency 
surface is a rough approximation to the normal frequency 
surface, and hence the subsequent statistical methods ap- 
propriate to the normal distribution will apply fairly well 
to the data of Table 30. Fig. 26 only approximates the nor- 
mal surface for the latter is a smooth curve shaped more 
like a bell and less like a pyramid, thus giving an even 
greater clustering toward the central tendency, as may be 
seen from Fig. 27. 

Why Frequency Surfaces Are Normal. — The normal 



Statistical Methods — Mass Measures 



357 



frequency surface appears to be Nature's favorite mold. 
A random sampling of most facts gives the normal sur- 
face. Morality, intelligence, the weights and heights of men, 
the blueness of eyes and doubtless the intensity of halos fit 
the normal curve. It is seldom that mental and educational 
scores make a perfectly normal surface, but they usually 
give a rough approximation to it. This is, according to 
Thorndike, not because Nature abhors irregular distribu- 
tions but because there are usually present in nature the 
necessary determiners. 

Experimental research has isolated these determiners, 
and has found that the measurements for a given fact fit 




SCORE 
FREQUENCY 



6 7 8 3 lOil 1213141516 
I 2 34 5 67 83109 

Fig. 28, Minus Skewed 
Frequency Surface. 



6 7 SdlOil 1213141516 
91098765432 I 

Fig. 29. Plus Skewed 
Frequency Surface. 



the normal frequency surface, when the fact measured is 
the product of the joint action of, (i) a large number of 
causes, (2) causes which are approximately equal, (3) 
causes which are mutually uncorrected or act independently 
of each other, i. e., the presence of one cause does not bring 
with it the other causes. It is because these conditions are 
usually present in education that most educational measure- 
ments approximate the normal distribution. 

Skewed Frequency Surface. — One form of departure 
from the normal surface is called the skewed frequency 
surface. Skewed surfaces may be either minus or plus. 
Fig. 28 represents a minus skewness and Fig. 29 a plus 
skewness. 

Why Frequency Surfaces Are Skewed.— It is actually 



358 How to Measure in Education 

possible to secure either one of these surfaces or approxima- 
tions to them from the application of a spelling test. Let 
us enquire why Fig. 28 shows a minus skewness. A com- 
plete knowledge of the situation would be necessary before 
an answer could be given. But it is possible to list some of 
the more probable causes. 

(i) Possibly the test consisted of but sixteen words 
and all of these were too easy for the abler pupils, thus 
causing a piling up of the scores at 15 and 16. Ordinarily 
we should expect the mode to be at 16 instead of 15, but 
very able pupils often make a score a few less than perfect 
from purely accidental misspelling. 

(2) Possibly the best half of the grade in spelling had 
been promoted just before the test was administered. In 
this event we can think of Fig. 28 as being the lower half 
of a once normal surface. 

(3) Possibly the school is located in a neighborhood 
where the intellectual composition of the parents corresponds 
to this surface, thus causing by heredity a similar condi- 
tion among the pupils. 

(4) Possibly the native abilities of the pupils are dis- 
tributed normally and the form of this surface is produced 
by the teacher's method of ceasing to drill pupils who have 
attained a standard of 16. A number of other teaching 
methods would produce similar results. 

(5) Possibly the frequency surface is not due to one 
cause but to the cooperation of two or more of the previously 
mentioned causes. 

It should be observed that all these explanations contem- 
plate a departure from (i) a large number of causes or (2) 
equality of influence of causes or (3) uncorrected causes. 
Most of the explanations have contemplated a very few 
causes each of which has enormous potency in determining 
the form of the frequency surface. The reader will find it 
profitable to stop and enumerate the more probable causa- 
tions of Fig. 29. 

Multi-Modal Frequency Surface. — Another departure 



Statistical Methods — Mass Measures 359 

from the normal surface is called the multi-modal frequency- 
surface. As the name implies, it has two central tendencies 
or modes. Such a surface is Fig. 30. 

Why Frequency Surfaces Are Multi-Modal. — Multi- 
modal surfaces may be produced by a number of causes, 
but most of them are the result of the operation of some 
one large force of great potency. The most common cause 
of all multi-modal surfaces is the mixing of two non- 
homogeneous groups. If, for example, scores from two 



y^^^ 



SCORE 6 7 8 3 10 II 12 13 14 1516 17 1819 2021 2223 

FREQUENCY 12 345 654224 565432 I 

Fig. 30. Multi-modal Frequency Surface. 

different grades are graphed together the result is usually 
a multi-modal frequency surface. 

II. Frequency Distributions 

Comparison of Frequency Surface and Frequency 
Distribution. — The second mass measure is called the 
frequency distribution, and, like the frequency surface, 
appears in three forms, normal, skewed, and multi-modal. 
In fact, the only difference between a frequency surface 
and a frequency distribution is this: the former shows in 
graphic form what the latter shows in tabular form. 

The very best method of constructing a frequency distri- 
bution is to first construct its corresponding surface. The 
two rows of numbers beneath the base line of Fig. 26 is an 
approximately normal frequency distribution. The fre- 
quency of score 6, for example, is found by counting the 
number of squares checked above it, in this case i, and 
similarly for the frequency of other scores. Figs. 28, 29, 



360 How to Measure in Education 

and 30 show, respectively, at their bases, two skewed and 
one multi-modal frequency distribution. 

Step Interval.— In the illustrations thus far the step 
interval has been i. Often the scores are so scattered that 
to use the original unit of scoring for a step interval would 
give a frequency surface or distribution with a very much 
attenuated appearance. This makes interpretation and sub- 
sequent manipulation difficult. For practical work Rugg 



n nrll- ^ — ifl 



SCORE 3 4 S 6 7 8 9 J0Sn2!3I4l5 16 17 18 S9 20 21 22232425 

FREQUENCY I00M0I30322 1321 1 10201 I 

Fig. 31. A Frequency Surface with Step Intervals of i. 

recommends that the step interval be of such size as to 
make the number of columns in the surface, or the number 
of steps in the distribution not more than 20 nor less than 10. 
Fig. 31 shows scores grouped according to the size of 
the original scoring unit of i, while Fig. 32 shows the scores 
regrouped into a step interval of 2. All the scores of Fig. 



I ' ^^~l 



SCORE 3 5 7 9 II 13 15 17 1921 2325 

FREQUENCY M 143443221 I 

Fig. 32. A Frequency Surface with Step Intervals of 2, 
(Data from Fig. 31.) 

31 from 3.0 up to but not including 5.0 are regrouped in 
Fig. 32 over the single interval 3.0 to 5.0. Those from 5 
to 7 are grouped over the second interval of Fig. 32 and 
so on. 

In actual practice Fig. 32 is derived immediately from 
the original data. We do not first plot an attenuated fre- 
quency surface and then convert it into a less attenuated 



Statistical Methods — Mass Measures 



361 



one. It is of course more difficult to fix upon a satisfactory 
step interval from the original data. The following pro- 
cedure will make the problem easy: (i) Subtract the 
smallest score in the original data from the largest. (2) 
Divide the remainder by such a divisor as will give a quo- 
tient between 10 and 20. (3) Make the divisor the size of 
the step interval. The size of this step interval should be 
kept constant throughout the frequency surface or distri- 
bution. 

All the frequency distributions that have been shown thus 
far have been located beneath frequency surfaces and placed 
horizontally. In actual computation frequency distribu- 
tions may be left in this position, but it is usually more con- 
venient to place them in a vertical position. The frequency 
distribution shown in Fig. 32 is divorced from its frequency 
surface and placed vertically in the first part of Table 31. 
It is condensed into larger step intervals in the second part 
of the table. 

TABLE 31 

Shows First the Frequency Distribution of Fig. 32, and Second, 
this Same Frequency Distribution Condensed into Step Inter- 
vals of 4 



Score 


Frequency 


Score 


Frequency 


3 — 5 


I 


3 — 7 


2 


5 — 7 


I 


7— II 


5 


7 — 9 


I 


II — 15 


7 


9— II 


4 


15 — 19 


7 


II — 13 


3 


19 — 23 


4 


13 — 15 


4 


23 — 27 


2 


15 — 17 


4 






17—19 


3 






19 — 21 


2 






21—23 


2 






23 — 25 


I 






25 — 27 


I 







How to Fix and Indicate the Step Limits. — No statis- 
tical computation should be begun until the step limits 
have been determined. It is not enough to know that the 



362 How to Measure in Education 

size of the step interval is .5, i, 2, 3, or more. We must 
know from what point to what point the step interval ex- 
tends. The meaning of the score tells us the beginning point 
of each step interval and the size of the step interval tells 
the ending point of each step interval. The meaning of 
each score in turn depends upon how the test was scored, 
and the step interval, as we have already seen, depends upon 
our own choice. Here are some test scores: 6, 7, 8, 9, etc. 
What is the meaning of each score? We cannot tell without 
further information as to how the test was scored. Let us 
suppose that these are scores from some performance test 
in arithmetic. Most performance tests are so scored that 
if a pupil works 6 examples correctly he receives a score of 
6; if he works 6.25 or 6.5 or 6.75 or 6.9 examples correctly 
he is still scored 6. Hence in this case 6 means 6 — 6.999, 
etc., or more conveniently 6 — 7. In other words the be- 
ginning point of our first step interval is 6.0. 

What is the step interval? We must decide this before 
the upper limit of the first step interval can be fixed. The 
step interval cannot in this case be less than i ; it may be 
made as large as we think desirable. Suppose we make the 
size of the step interval i. Then the steps of the frequency 
distribution would be, 6 — 7, 7 — 8, 8 — 9, etc. If we make 
the size of the step interval 2, the steps become, 6 — 8, 8 — 
10, 10 — 12, etc. 

But if our scores of 6, 7, 8, etc., came from a product 
scale instead of a performance scale, in all probability 6 
does not mean 6 — 7, but 5.5 — 6.5, because most product 
scales, the Thorndike Handwriting Scale, for example, are 
so scored that a score means half a step below to half a 
step above a given scale value. If a pupil's handwriting is 
nearer in quality to value 6 than to any adjoining value he 
is scored 6, i. e., 5.49 is scored 5, while 5.5, 5.7, 5.9, 6.0, 6.3, 
6.4999 or any intermediate values are scored 6. Thus the 
meaning of the score depends upon how the test was scored. 
Now if we make the size of the step interval i, the steps 
will appear as follows: 5.5 — 6.5, 6.5 — 7.5, 7.5 — 8.5, 



Statistical Methods — Mass Measures 363 

etc. If we make the size of the step interval 3, the steps 
become, 5.5 — 8.5, 8.5 — 11. 5, etc. Here are some scores 
from the Ayres Handwriting Scale, 30, 40, 50, etc. The 
size of the step interval cannot be less than 10 because the 
values on the Ayres scale are given in units of 10 and be- 
cause the scorer has obviously not attempted to score be- 
tween the values appearing on the scale. Though the Ayres 
scale may be so scored that 30 means 30 — 40, it, being a 
product scale, is customarily scored in such a way that 30 
means 25 — 35. If we make the size of the step interval 
10, the steps of the frequency distribution will appear as 
follows: 25 — 35, 35 — 45, 45 — 55, etc. 

How shall the step limits be indicated once they are fixed? 
There are four common methods of indicating step limits, 
but no matter which method is employed the important 
point is to remember just what the limits are, lest tabulation 
and statistical errors result. The conventional methods 
follow. 

Where 6.0 is the lower step limit and the step interval is i: 

I II III IV 

Lower Both Both 

Midpoint Limit Limits Limits 

Method Freq. Method Freq. Method Freq. Method Freq. 

6.5 I 6. I 6 — 7 I 6 — 6.999 I 

7-5 3 7- 3 7 — 8 3 7 — 7999 3 

etc. etc. etc. etc. etc. etc. etc. etc. 

Where 5.5 is the lower step limit and the step interval is 2: 

6.S I $.5 I S-S- 7-5 I 55 — 7499 i 

8.5 3 7-5 3 7-5 — 9-5 3 7-5 — 9499 3 

etc. etc. etc. etc. etc. etc. etc. etc. 

Method IV will cause fewest errors and is recommended 
for the beginner. Method II is most convenient and is 
recommended for use after the student has firmly fixed the 
habit of thinking of a score as spread over an interval. 
Method III will be used throughout this discussion, and the 
reader is specially warned to remember that 6 — 7 means 
6 and up to but not including 7, i. e., 6 up to 6.999999999 



364 How to Measure in Education 

and so on to infinity even though it will be written for con- 
venience 6 — 7. 

III. Order Distribution 

Comparison and Construction. — The third mass 
measure is the simplest of all, and, like the frequency dis- 
tribution, is the basis for certain tj^es of subsequent 
statistical treatment. It is not so helpful, however, as either 
the frequency surface or the frequency distribution in the 
immediate study of the condition of a class. 

If the scores in Table 30 are arranged in order of their 
size beginning with the smallest they appear as follows: 
6, 7, 7, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 10, and so on up to 
20. Such an arrangement would be an order distribution 
and would obviously be more intelligible than the arrange- 
ment in Table 30. 

There is a fourth mass measure which is in some respects 
like the order distribution. This is the rank distribution. 
The construction and use of this measure will be considered 
later. Suffice it to say here that the construction of a rank 
distribution depends upon the preliminary construction of 
an order distribution. The order distribution shows not 
only what pupil made the highest score but also what was 
the actual numerical size of the score. The rank distri- 
bution neglects the numerical size of the score and merely 
states which pupil ranks highest, second, third and so on. 



CHAPTER XV 
STATISTICAL METHODS— POINT MEASURES 

Kinds and Functions of Point Measures. — Mass 
measures may be vague and statistically cumbersome, but 
they possess the virtue of including every score in the class. 
It is the function of point measures to represent the condi- 
tion of a class by a single number. The point chosen to 
represent the class depends upon the statistical method em- 
ployed. The common methods are: 

I. Mode. 

II. Mean.i 

III. Median or Midscore. 

IV. Lower Quartile Point. 
V. Upper Quartile Point. 

I. Mode 

What Is the Mode? — The true mode is too difficult in 
its computation for general use and will not be discussed 
further, but the crude mode is so simple that its calculation 
need not be specially illustrated. It is the simplest of all 
point measures, and gives a sort of look-and-see class score. 
The crude mode is the most frequent score: The mode of 
Fig. 26 is 13, of Fig. 28 is 15, of Fig. 29 is 7, of Fig. 30 is 
II and 18 since there are two modes, and of Table 32 is 7. 
Since 13 means 13 to 13.99, ^^^ similarly for the other 
scores, it is probably better to say the modes are at the mid- 
points, 13.5, 15.5, 7.5, etc. 

1 Throughout this book the term average is used in its generic sense to signify 
mode, mean, median, midscore, or any other measure of central tendency. 

365 



366 How to Measure in Education 

II. Mean 

What Is the Mean? — A very common experience for 
many students is that they forget or get confused about all 
their ordinary arithmetic just as soon as they begin to study 
statistics. So it is well to state at the outset that the mean 
or arithmetic mean of statistics is the same good-old-school- 
day average. It is the sum of the scores divided by the 
number of the scores. Two differences will be noted. In 
the first place, when the mean of childhood was calculated 
from numbers like 12, 13, etc., 12 was 12 and wasn't 12- 
stretching-to-12.99, nor i2-wriggling-backward-to-ii.5-and- 
squirming-forward-to-12.499, or at least we weren't con- 
scious of 12 being all this. In the second place, statistics 
has developed certain shortcuts which are more or less 
novel, but which are very economical whenever the scores 
became numerous, even though they may appear more cum- 
bersome for the following simple illustrations. 

How to Compute the Mean. — Illustrative problems 
are worked out in Tables 32 and 33. 

(a) Scores Ungrouped (Table 32) 

(i) The scores are tabulated in an order distribu- 
tion, though this is not necessary. 

(2) The sum of the scores is 156 and the number of 
scores is 24. 

156 

(3) Then the mean = h -5 = 7-0- This test 

24 

is so scored that 2 means 2 — 2.999 2<nd hence 
2 is most truly represented by its mid-point 2.5 
and similarly for all the other scores. This will 

156 
make the mean .5 higher than . This can 

^ ^ 24 

be proved by adding up 2.5, 3.5, 4.5, etc., and 
by dividing this sum by 24. 



Statistical Methods — Point Measures 



367 



TABLE 32 

Number of Examples Done Correctly on Courtis Addition 

Test Series B 



Scores Ungrouped 


Pupil 


Score 


I 


2 


2 


3 , 


3 


4 


4 


4 


5 . 


5 


6 


5 


7 


5 


8 


5 


9 


6 


10 


6 


II 


6 


12 


6 


13 


7 


14 


7 


15 


7 


16 


7 


17 


7 


18 


8 


19 


8 


20 


8 


21 


9 


22 


9 


23 


10 


24 


12 


Sum 


= 156 


N 


= 24 




i';6 


Mean ■ 


- 24 +-5 




= 7.0 



Mean = 7-5 + (—-5) 
= 7.0 



Scores Grouped in Step Intervals of i 






Deviation 




Score 


Frequency 


from 

Guessed 

Mean 


Freq. x Dev. 


2 — 3 


I 


— 5 


— 5 


3 — 4 


I 


— 4 


— 4 


4 — 5 


2 


— 3 


— 6 


5-6 


4 


— 2 


— 8 


6-7-, 


4 


— I 


— 4 


7-8 
8-9 


3 





— 27 


I 


3 


9 — 10 


2 


2 


4 


10 — II 


I 


3 


3 


II — 12 





4 





12—13 


I 


5 


5^ 


IS 


Guessed 


= 24 
Mean = 7 


.5 


15 

— 27 



12 



— .5 



24. 



(b) Scores Grouped {Table 32) 

(i) The scores are retabulated in a frequency dis- 
tribution. Previously most frequency distribu- 



368 How to Measure in Education 

tions have been shown on the horizontal, but 
the vertical position is more convenient for 
statistical work. 

(2) The sum of the frequencies or TV = 24. 

(3) Any step near the middle of the distribution is 
called the guessed mean. Guessed mean = 7.5. 
Any step may be chosen and the mean will 
come out the same. 

(4) The scores are turned into deviation from the 
guessed mean. The step 6 — 6.99 is one point 
below ( — I ) the guessed mean. Step 8 

— 8.99 is one step above (+ i) the guessed 
mean and so on. 

(5) Each deviation is multiplied by its correspond- 
ing frequency. There is one deviation of — 5 
and — 4. There are two deviations of — 3 
making a total deviation of — 6 and so on. 

(6) The sum of the minus deviations is — 27 and 
of the plus deviations + 15. The net sum is 

— 12, which when divided by N gives the cor- 
rection — .5. This — .5 tells that the real mean 
is .5 below the guessed mean. 

(7) The guessed mean is corrected to give the real 
mean. The commonest errors made in using 
the short method are: (a) failure to guess the 
mean at the mid-point of the step interval 
chosen; (b) failure to multiply the deviations 
by the frequencies; (c) a tendency to con- 
fuse the score column and the frequency 
colunm. 

(a) Scores Ungrouped (Table 33) 

(i) The scores are arranged in an order distribu- 
tion. 

(2) The sum of the scores is 1380 and N is 24. 

(3) The mean == -^ = 57.5. No correction is 

24 



Statistical Methods — Point Measures 



369 



TABLE 33 

Quality of Penmanship as Judged with the Ayres 
Handwriting Scale 



Scores 


Ungrouped 




Scores Grouped in 


Step Intervals of 10 










Deviation 




Pupil 


Score 




Score 


Frequency 


from 

Guessed 

Mean 


Freq. x Dev. 


I 


20 


15 — 25 


2 


— 30 


— 60 


2 


20 












3 


40 












4 


40 




25 — 35 





— 20 


— 


5 


40 












6 


50 












7 


SO 




35 — 45 


3 


— 10 


— 30 


8 


50 

















9 

T i^ 


50 




4 r" f r" 


6 


/v 


— 90 


10 


50 




45 55 







II 


50 












12 


60 




/■•< 








13 


60 




55 ^-'^65 


5 


10 


SO 


14 


60 




! 








15 


60 












16 


60 




65 — 75 


4 


20 


80 


17 


70 












18 


70 












19 


70 




75-85 


2 


30 


60 


20 


70 












21 


80 












22 


80 




85-95 


2 


40 


80 


23 


90 












24 


90 










270 


Sum = 


= 1380 


N r= 24 




270 


N = 


24 




Guessed Mean = 50 




— 90 


Mean = 


180 




24 




Mean = 50 + 7-5 




— *7 e 




— 7'D 




= 57.5 




= 57-5 




24 



ww^ 



^\ 



<s 



added because the scores 20, 40, etc., are already 

at their mid-point. The values on the Ayres 

<\,,^^j^JIajidwriting Scale are 20, 30, 40, etc., with no 

^\ interveninjg values. - . "is «^ 

»«** '**' ^'\ *^» •»'**** 'i" » •-' 



370 How to Measure in Education 

As the scale is ordinarily used, a pupiPs pen- 
manship specimen which falls barely above 25 
is called 30 and one barely below 35 is also 
called 30, hence 20 means 15 — 24.99 ^^^ 3° 
means 25 — 34.99. Thus the scores 20, 30, 40, 
etc., are, unlike the original scores in Table 32, 
already at the mid-point, and the mean needs no 
correction. If we imagine 20, 40, etc., to mean 
instead 20 — 29.99 ^^^ 4° — 49-997 the correc- 
tion would be not .5 as in Table 32 but 5 since 
the mid-point of each score would be 5 higher 
than the beginning point. 

(b) Scores Grouped {Table ^2>) 

(i) The scores are retabulated in a frequency dis-» 
tribution. It will be observed that the step 
limits are made to fit the real meaning of the 
scores. 

(2) iV == 24. Guessed mean = 50, the mid-point of 
the 45 — 54.99 step. 

(3) The deviation from the guessed mean of 35 — 
44.99 is — 10. Note that the 25 — 34.99 step 
is inserted into the distribution. This must 
always be. done. 

(4) The remainder of the process is similar to that 
described for Table 32. 

III. Lower Quartile, Median, and Upper Quartile 

What Are the Median and Quartile Points? — These 
three point measures are treated together because of the sim- 
ilarity of their computation. The intimate relation of the 
three is shown by their definition. The lower quartile or Q^ 
or 25 percentile is that point below which are 25% of the 
scores and above which are 75% of the scores. The median 
or JO percentile is that point below which are 50% of the 
scores and above which are 50% of the scores. The upper, 



rmmA 



Statistical Methods — Point Measures 



371 



quartile or Q 3 or 75 percentile is that point which has 75% 
oj the scores below it and 25^0 above it. 
Computation of Qi Median, and O3.— 



TABLE 34 

Number of Examples Done Correctly on Courtis Addition Test Series B 



Scores Ungrouped 


Pupil 


Score 


Computation 


I 
2 
3 
4 
5 
6 


2 
3 
4 
4 
5 
5 


4 

Oi = 5 + 2/4 X I 

(3i==S.5 


7 
8 

9 


5 
5 
6 


— =i2th 
2 


10 
II 


6 

6 


Median = 7 + - x i 


12 
13 


6 
7 


Median = 7 


14 


7 

7 




15 




16 
17 


7 
7 


3/4iV==i8th 


18 
19 


8 
8 


(23 ==8+1/3x1 


20 
21 


8 
9 


(33 = 8.33 


22 


9 




23 


10 




24 


12 




N - 


= 24 





Scores Grouped in Step 
Intervals of i 


Score 


Fre- 
quency 


Computation 


2-3 
3-4 
4-5 
5-6 
6-7 
7-8 
8-9 
9-10 

lO-II 

11-12 
12-13 


I 
I 
2 
4 
4 
5 
3 
2 
I 

I 


4 

(3i = 5+-xi 

4 
<3i = 5.5 


^=I2th 

2 



Median =7 +— xi 

Median = 7 


3/4iV=i8th 
Q3 = 8 + 1/3 X I 

(33 = 8.33 


iV = 24 



'7 



(a) Scores Ungrouped {Table 34) — Q^ 

( 1 ) The scores are arranged in an order distribution. 

(2) N = 24. How far down to count in order to 

N 
locate Qi is shown by — . The 6th score then 

is the lower quartile point. 



372 How to Measure in Education 

(3) The 6th score is 5. But since 5 means 5 — 5.99 
just where between 5 and 5.99 is the Q^} There 
are 4 scores spread between 5 — 5.99. Two of 
these 4 were used up in counting down 6 scores 
so the best guess is that Q^ is 5 plus 2/4 of the 
distance 5 to 5.99. Q^ = 5.5. 

(b) Scores Ungrouped — Median 

N 
(i) — = 12. The 12th score down is the median. 
2 

(2) The 12th score uses up o of the five scores of 7, 

hence the median is 7 plus 0/5 of the distance 

from 7 to 7.99. Median = 7.0. 

(c) Scores Ungrouped — Q^ 

Three- fourths of iV = 18. The i8th score is 
the Q^. The iSth score uses up one of the three 
8's. Therefore (Jg = 8 + 1/3 X i = 8.33. 

(d) Scores Grouped — Qi 

( 1 ) The scores are retabulated in a frequency distri- 
bution. 

, N 

(2) — = 6. Hence the 6th score is the Q^. The. 

6th score uses up frequencies, viz: i + i + 2 
and 2 of the four 5 — 5.99's. Hence Qi = 5 + 

2 

- X I = 5.5. Thus the process is identical with 

that given for scores ungrouped. 

(e) Scores Grouped — Median 

The process is identical with that given for 
scores ungrouped. 

(f) Scores Grouped — Q^ 

The process is identical with that given for 
scores ungrouped. 
The methods of calculating these three measures are prac- 
tically identical whether the scores are grouped or un- 
grouped. When scores are grouped the counting down to ■ 



Statistical Methods — Point Measures 



373 



locate the three point measures is simplified, as is also the 
determination of the size of the correction. In locating the 
Qi, for example, the 6th score uses up two of the four scores. 
The number 2 is placed as a numerator over the number 4, 

2 
thus, -, and we have the correction as soon as the fraction 

4 

is multiplied by the step interval. Multiplying by a step 
interval of i does not affect the correction, but it is well to 
establish the habit. Table 35 shows its necessity. 



TABLE 35 

Quality of Penmanship as Judged with the Ayres Handwriting Scale 



Scores Ungrouped 


Pupil 


Score 
20 


Computation 


I 


^=6th 


2 


20 


4 


3 

4 


40 
40 


Qi = 45 + 1/6 X 10 


5 
6 


40 
SO 


Qi = 46.67 


7 


SO 
SO 




8 




9 


SO 


N 


10 


SO 


—= i2th 


II 


SO 




12 


60 


Median = 55+1/5x10 


13 


60 




14 


60 


Median = 57 


15 


60 




16 


60 




17 


- 70 




18 


70 


3/4iV=:i8th 


19 


70 




20 


70 


^3 = 65 + 2/4x10 


21 


80 




22 


80 


(28 = 70 


23 


90 




24 


90 




N = 


= 24 



Scores 


Grouped in Step Intervals of lo 


Score 


Fre- 
quency 


Computation 


15-25 


2 


4 


25-35 





Qi == 45 -f 1/6 X 10 

(2i = 46.67 


35-45 
45-5S 


3 
6 




2 


55-65 


5 


Median = 55 + 1/5x10 
Median = 57 


65-75 


4 




75-85 


2 


3/4 N = i8th 

Qa = 65 + 2/4 X 10 


85-95 


2 


(23=70 


N = 


z 24 





374 ^ow to Measure in Education 

(a) Scores Ungrouped (Table 55) — Qi 

(i) The scores are arranged in an order distribu- 
tion. 

N 
(2) iV — 24. — = 6. The 6th score is Q^. Count- 
4 
ing down six scores, one of the six scores of 50 

is used up. Hence the correction is 1/6 multi- 
plied by the step interval 45 — 54-99. The 
correction, 1/6 X 10 is added to, not 50, but 
the beginning point of the score of 50, which is 

45. 

The calculation for Qj shows that the process is practically 
identical with that given for a performance test and this is 
true for the median and Q3 as well. So beyond pointing out 
two slight differences, the illustration may be left to speak 
for itself. In the preceding problem the step interval was i, 
in this it is 10, hence in this problem corrections are multi- 
plied by 10. Again, in the first problem the score was 
expressed in its original form as the beginning point of the 
step, while in the second it was the mid-point. The fre- 
quency distributions in both problems make these two points 
clear. 

Most of the difficulties which worry beginners in comput- 
ing Qi, Median, and Q3 have now been considered. The 
computation of the median in Table 36 shows how to deal 
with a slightly different situation which is a frequent source 
of difficulty. There are some scales whose values are irreg- 
ular. Such, for instance, is the Nassau County Extension 
of the Hillegas Composition Scale. Table 36 shows how 
to deal with such a situation. Note that the step limits are 
the half way points between scale values and that the size 
of the step interval fits the distances between scale values. 
In other words the step limits are made to harmonize with 
the way in which this product scale was scored. Also ob- 
serve that even though N is an odd number the regular 
computation technique is unaffected. Also observe that 



Statistical Methods — Point Measures 



375 



the upper limit of the last step interval cannot be given 
because the last value on the scale is 9.0. 

TABLE 36 

Scores According- to the Nassau County Composition Scale. 
Sample Scores: o, i.i, 1.9, 2.8, 3.8, 5.0, etc. 



Quality 


Frequency 


Computation 


.55 — 1.5 
1.5 —2.35 
2.35 — 3.3 


I 

2 

3 

4 

5 
2 



I 

I 


N 19 
4 4 

Q- = 2.35 + 3^ ^ -95 

01 = 2.9 


3.3 — 4.4 

4.4 —5.5 

5.5 -^-^ 

6.6 —7.6 


N 19 
2 ~ 2 - 9.5 

Median = 3.3 + ^^-^ x i.i 

4 
Median = 4.26 


7^^ -8.5 
8.5 - 


3/4 N = 3/4 of 19 = 14.25 
03 = 4.4+ ^ X I.I 
03 = 5.34 



N 



19 



Another common source of difficulty is gaps in the scores, 
or rather, steps whose frequency is zero. Table 37 illus- 
trates what to do when this difficulty is met. Note the pro- 

N 
cedure in computing Q^. — = 2. Counting down the 

''Frequency" column 2 carries us through the step 2 — 4 
but it doesn't carry us into the step 6 — 8. In other words 
Qi is between 4 and 6. When there is such a gap the best 
policy is to divide it equally between the step interval below 
and the step interval above. This has the effect of making 
the lower step interval 2 — 5 and the upper step interval 
5 — 8. Consequently the correction is added to 5, the be- 
ginning point of the new step, and the correction is multi- 



376 



How to Measure in Education 



plied by 3, the size of the new step interval. Similarly in 
computing the median the beginning point of the step be- 
comes 10, and the size of the step interval becomes 4. In 
computing Q3, these numbers become 17 and 5 respectively. 

TABLE 37 

Scores According to the Monroe Standardized Fundamentals of 
Arithmetic Test Grouped in Step Intervals of 2 



Score 


Frequency 


Computation 


— 2 
2— 4 
4- 6 
6— 8 
8—10 
10 — 12 
12 — 14 


I 
I 

2 



2 



2 


4 4 

0^ == 5 + f X 3 


N 8 

2 -T-4 



Median = 10 + — x 4 

Median = 10 


14 — 16 
16—18 
18 — 20 
20 — 22 


3/4iV = 3/4of8 = 6 
03 = 17 + ^ X 5 

03=17 



iV 



How to Compute the Midscore. — A small class of 
nine pupils made the following scores in an arithmetic test: 

3, 9y 8, 7, II, 12, 6, 3, 13. 

The median of this series of scores, according to the compu- 
tation method just described, is 8.5. The midscore on the 
other hand is 8. 

The first step in computing a midscore is to arrange the 
scores in order of size. The above when so arranged are: 

3, 3, 6, 7, 8, 9, II, 12, 13. 



1. 



Statistical Methods — Mass Measures 377 

The second step is to divide the number-of -scores-plus-one 
by two, thus: 

(9+ l)-^2 =5. 

This tells us that the fifth score, counting from 3 toward 13 
or from 13 toward 3, is the midscore. This gives a midscore 
of 8. When there is no midscore the midscore is usually 
considered as the mean of the two middlemost scores. 

When to Use Each Average.— Use the mode when 
(a) quick computation is essential, or (b) the most frequent 
score is desired. 

Use the mean when (a) every score should have an influ- 
ence in determining the average which is exactly propor- 
tionate to the score's amount, or (b) when the lowest unrelia- 
bility is sought, or (c) when subsequent correlation or other 
formulae or procedures require the mean. 

Use the median when (a) quick computation is fairly im- 
portant, or (b) a more popular average than the mean is 
desired, or (c) it is important that extreme or erroneous 
scores should not markedly influence the average, or (d) it 
is desired that certain scores exercise an influence in deter- 
mining the average when all that is known concerning these 
scores is that they are above or below the average. 

Use the midscore when (a) a very simple method of com- 
putation is required, or (b) the scores are discrete rather 
than continuous, i. e., they do not measure an infinitely con- 
tinuous or divisible fact such as adding ability but a discrete 
indivisible fact such number of pupils and the like. 

Usually the mean or median should be used. 



CHAPTER XVI 

STATISTICAL METHODS— VARIABILITY 
MEASURES 

Need for Variability Measures. — The invariable fact 
in educational measurement is that pupils vary. Such vari- 
ation, dispersion, or spread is a significant determiner of 
educational procedure. A measure of central tendency, 
while important, does not give a complete description of the 
condition of a class. Two classes when measured for 
spelling ability might show identical central tendencies with 
one varying from second-grade to eighth-grade ability and 
the other scarcely varying at all. 

Nature of Variability Measures. — If we imagine the 
scores of a class to be tabulated in a frequency surface, the 
median or mean of this class is a mid-point and may be 
thought of as a point at or near the middle of the base line 
of the frequency surface. A variability measure on the 
other hand is not a point but a distance, just in the same way 
as an inch is a distance. The inch is a constant distance, 
but the variability measure is a variable distance, varying 
with the distribution for which it is calculated. A variabil- 
ity measure may be thought of as a certain distance along 
any part of the base line of a frequency surface. It is, 
however, most commonly thought of as being a certain dis- 
tance just above or just below the central tendency. 

T3^es of Variability Measures. — The mass measures 
— frequency surface, frequency distribution, and order dis- 
tribution — already described give a far clearer picture of the 
variation within a class than do any other measures. If, 
however, it is desired to make use of variability in situations 

378 

ii 



Statistical Methods — Variability Measures 379 

where a single numerical value for it is required, one of the 
following conventional measures should be calculated, 

I. Total Range, The total range distance includes 
100% of the scores. 

II. Quartile Deviation (Q) or Semi-Interquartile Range. 
A distance oi Q above and a distance oi Q below the central 
tendency includes roughly the middle 50% of the scores. 

III. Mean Deviation (Mn.D. or A.D.). A distance 
of mean deviation above and below the central tendency 
includes roughly 57.5% of the scores. 

IV. Standard Deviation (S.D.) or Mean Square Devia- 
tion or Sigma ( o- ) . A distance of S.D. above and below the 
central tendency includes roughly 68% of the scores. 

Neither the Median Deviation (M.D.) nor the Probable 
Error (P. E.) is included in the above list, nor will their 
calculation be illustrated at this point. The probable error 
will be considered later. For all practical purposes they 
may be considered equal to Q. In a normal distribution 
they are exactly so. It is better to reserve P.E. exclusively 
for use as an unreliability measure. The computation of 
the M.D. is, however, very simple. Just as a mean devia- 
tion is a mean of deviations regardless of signs, so an M.D. 
is a median of deviations without regard to signs. 

Transmutation of One Variability Measure into 
Another. — The student will rarely need to compute 
more than one measure of variability, but if he does, and the 
distribution of scores is normal, all the other measures can 
be gotten from just one by means of the following. If the 
distribution is only approximately normal these relationships 
hold only approximately. 

Q or M.D. or P.E. = .6745 S.D. 
Mn.D. = .7979 S.D. 
Q or M.D. or P.E. ^ .8453 Mn.D. 

I. Total Range 

Nature and Computation. — The total range is, as its 
name implies, the distance from the smallest score to the 



380 How to Measure in Education 

largest score. It is computed by simply making the sub- 
traction. 

Like the mode in the simplicity of its computation, it is 
also like the mode in that it is valuable as an inspection 
measure only. This is so because it is peculiarly liable to 
large fluctuations, depending as it does upon but two scores. 

II. QuARTiLE Deviation 

How to Compute Q, — Because of the identity of 
Q with M.D, or P,E. in a normal distribution; because of 
its satisfactory approximation to them in distributions which 
are not exactly normal; because zt Q above and below 
central tendency includes an easily understood middle 50% 
of scores, and because of the very great ease of its computa- 
tion, the quartile deviation has become very popular. 

The ease of its computation is shown by the following 
formula: 

2 

Thus it is half the distance from the lower quartile point to 
the upper quartile point. The scores in Table 34 yield a Q^ 
of 8.33 and a Q^ of 5.5. Then for this class 



For Table 35 



^^8^33:-5^^^_^^,^ 



Q=Z^:^46^ = 11.66 + 



Thus the Q for each distribution is expressed in terms of its 
own distribution. 

III. Mean Deviation 

How to Compute the Mn.D. — The Mn.D, is a mean 
of the deviations from any measure of central tendency, no 
account being taken of signs. Table 38 and Table 39 illus- 
trate the calculation of Mn.D. from the median. The 



ti 



Statistical Methods — Variability Measures 381 

Mn.D., however, is more frequently computed from the 
nean. 

TABLE 38 
Number of Examples Done Correctly on Courtis Addition Test, Series B 



Scores Ungrouped 






Devia- 


Pupil 


Score 


tion 

from 

Median 


I 


2 


— 4-5 


2 


3 


— 3 


S 


3 


4 


— 2 


5 


4 


4 


— 2 


5 


5 


5 


— I 


5 


6 


S 


— I 


S 


7 


S 


— I 


S 


8 


5 


-~i 


•S 


9 


6 


— 


5 


10 


6 


— 


5 


II 


6 


— 


5 


12 


6 


— 


5 


13 


7 




5 


14 


7 




S 


IS 


7 




S 


16 


7 




S 


17 


7 




S 


18 


8 


I 


5 


19 


8 


I 


S 


20 


8 


I 


S 


21 


9 


2 


5 


22 


9 


2 


5 


23 


10 


3 


5 


24 


12 5-5 


N =24 Sum = 42.0 


Median = 7 


Mn.D. =— = 1.75 


24 



Scores Grouped in Step Intervals of i 


Score 


Fre- 
quency 


Devia- 
tion 
from 

Median 


Fre- 
quency 
Times 
Devia- 
tion 


2— 3 


I 


— 4-5 


— 4-5 


3— 4 


I 


— 3.5 


— 35 


4— S 


2 


— 2.5 


— S-o 


S- 6 


4 


— i.S 


— 6.0 


6— 7 


4 


— -5 


— 2.0 


7- 8 


5 


.5 


2.5 


8- 9 


3 


i-S 


4-5 


9 — 10 


2 


2.S 


S-o 


10 — II 


I 


3-S 


3-5 


II — 12 





4-5 


0.0 


12 — 13 


I 


5-5 


5-5 


N - 


= 24 


Sum : 


= 42.0 


Median = 


= 7 






Mn.D. - 


_ 42 _ 
24 


1.75 





(a) Scores Ungrouped (Table 38) 

( 1 ) The scores are arranged in an order distribution, 
though this is not necessary. 

(2) N = 24, and the median according to previous 
calculation equals 7. 



382 How to Measure in Education 

(3) The scores are expressed as deviations from the 
median. The first score of 2 , meaning as it does 
2 — 2.99, is best represented by its mid-point 
2.5. The first score deviates from the median 
4.5, the second by 3.5 and so on. The minus 
signs indicate deviations downward, but since 
the Mn.D. disregards signs they are not really 
needed. 

(4) The sum of the deviations regardless of signs 
is 42. 

(5) Mn.D. equals the sum of the deviations divided 

by N. Mn.D. = ^ = i.yc 
^ 24 '^ 

(b) Scores Grouped 

(i) The scores are retabulated in a frequency dis- 
tribution. 

(2) The deviation of the first step, 2 — 2.99, is 4.5, 
of the second step, 3.5 and so on. These devia- 
tions are not step deviations but actual devia- 
tions. 

(3) The deviations are multiplied by their corre- 
sponding frequencies. There is one deviation 
of 4.5, one of 3.5, two of 2.5 which means a total 
deviation of 5.0, etc. 

(4) The sum of the deviations regardless of signs 
is 42. 

(5) Mn.D. = ^ = 1.75. 

24 

The student will find Table 39 self-explanatory, for there 
is no difference from the method described above, except 
that the ungrouped scores are already at their mid-point. 

Mn.D, vs. 0, — It was pointed out a few pages back 
that ± Q includes 50% of the scores and zb Mn.D. includes 
57.5% of the scores. If this is so, the Mn.D. should be 
larger than Q for the same distribution. The illustrative 
problems reveal just such a relationship. 



Statistical Methods — Variability Measures 383 
TABLE 39 

Quality of Penmanship as Judged by the Ayres Handwriting Scale 



Scores Ungrouped 






Devia- 


Pupil 


Score 


tion 

from 

Median 


I 


20 


— 37 


2 


20 


— 37 


3 


40 


— 17 


4 


40 


— 17 


5 


40 


— 17 


6 


so 


— 7 


7 


SO 


— 7 


8 


SO 


— 7 


9 


SO 


— 7 


10 


SO 


— 7 


II 


SO 


— 7. 


12 


60 


3 


13 


60 


3 


14 


60 


3 


15 


60 


3 


16 


60 


3 


17 


70 


13 


18 


70 


13 


19 


70 


13 


20 


70 


13 


21 


80 


23 


22 


80 


23 


23 


90 


33 


24 


90 


33 


N - 


= 24 Sum 


L = 346 


Median - 


= 57 




Mn.D. - 


_ 346 _ 
24 


: 14.416 + 



Scores Grouped in 


Step Intervals of lo 


Score 


Fre- 
quency 


Devia- 
tion 
from 

Median 


Fre- 
quency 
Times 
Devia- 
tion 


15 — 24-9 


2 


— 37 


— 74 


25 — 34.9 





— 27 


00 


35 — 44-9 


3 


— 17 


— SI 


45 — 54-9 


6 


— 7 


— 42 


55 — 64-9 


5 


3 


IS 


65 — 74-9 


4 


13 


52 


75 — 84-9 


2 


23 


46 


85 — 94-9 


2 


33 


66 


N 


= 24 


Sum 


= 346 


Median = 


= 57 






Mn.D. z 


346 _ 
~ 24 


= 14416 


+ 



Problem I. 
Problem II. 



Q= 1.41. Mn.D.= 1.75 
Q = 11.16. Mn.D. = 14.42 



IV. Standard Deviation 

How to Compute the 5.D.— Like the Mn.D., the S.D. 

may be computed from any average or measure of central 



384 



How to Measure in Education 



tendency whether mode, median, or mean. Tables 40 and 
41 illustrate its calculation from the mean. 



TABLE 40 

Number of Examples Done Correctly on Courtis Addition Test, Series B 



Scores Ungrouped 






Deviation 








from 


Devia- 


Pupil 


Score 


Guessed 


tion 






Mean 


Squared 


I 


2 


— 5 


25 


2 


3 


— 4 


16 


3 


4 


— 3 


9 


4 


4 


— 3 


9 


5 


5 


— 2 


4 


6 


5 


— 2 


4 


7 


5 


— 2 


4 


8 


5 


— 2 




9 


6 


— I 




10 


6 


— I 




11 


6 


— I 




12 


6 


— I 




13 


7 








14 


7 








15 


7 








16 


7 








17 


7 








18 


8 


I 


I 


19 


8 


I 


I 


20 


8 


I 


I 


21 


9 


2 


4 


22 


9 


2 


4 


23 


10 


3 


9 


24 


12 


5 


25 


N z 


= 24 Sum 


= 124 


Mean : 


= 7.0 




Guessed 


Mean = 7.5 








z 2.217 + 


S.D. = ^ 


/-; (7.5-7.0)- 




24 





Scores Grouped in Step Intervals of i 



Score 



2— 3 

3— 4 

4— 5 

5— 6 

6— 7 

7— 8 

8— 9 
9 — 10 

10 — II 

11 — 12 
12 — 13 



Fre- 
quency 


Deviation 

from 

Guessed 

Mean 


I 


— 5 


I 


— 4 


2 


— 3 


4 


— 2 


4 


— I 


5 





3 


X 


2 


2 


I 


3 





4 


I 


5 



Fre- 
quency 
Times 
Devia- 
tion * 

25 

16 

18 

16 

4 
o 

3 
8 
9 
o 

25 



AT = 24 
Mean = 7.0 
Guessed Mean z=z 7.5 



Sum = 124 



24 



7.0)2 = 2.217 -}- 



(a) Scores Ungrouped (Table 40) 

(i) The scores are arranged in an order distribu- 
tion, though this is not necessary. 

(2) N = 24, and according to previous calculation 
the mean = 7.0. In order to avoid decimals in 
the deviations a guessed mean of 7.5 is used 
instead of the mean 7.0. A guessed mean of 9.5 
or 2.5 would serve just as well. 



Statistical Methods— Variability Measures 3^5 

(3) Each score is expressed as a deviation from the 
guessed mean. 

(4) Each deviation is squared. 

(5) The sum of the squared deviations is 124. 

(6) The S.D. equals the square root of the sum of 
the squared deviations divided by N, minus the 
correction squared. The correction is the 
difference between the mean and the guessed 
mean, in this case .5. 






(b) Scores Grouped . ^ jv 

(1) The scores are retabulated m a frequency dis- 
tribution. 

(2) iV = 24 and the mean = 7- 

U) The mid-point of any step near the middle of 
the distribution is taken as the point of refer- 
ence This guessed mean is always guessed at 
the mid-point of the step chosen. Guessed 

mean = 7.5- , , . ,. „ 

(4) Each step is expressed as an actual deviation 

from the guessed mean. 
(O Each deviation is squared and then multiplied 
by its corresponding frequency. Beginning at 
the top, this gives results, viz: (5)^ X i — 25, 
(4)^X1 = 16, (3)^X2 = 18, etc. 

(6) T^ViA ^nm of the sq uared deviations is 124. 

(7) The S,D, = j/ ^ - (correction)^ The 

correction is the difference between the guessed 
mean and the actual mean, in this case .5. In 
case there is no difference the correction is zero. 
The advantage of computing deviations from 
the guessed mean and then correcting, instead 
of from the actual mean, is because the devia 
tions can always be kept in whole numbers. 



386 



How to Measure in Education 



After the explanation given above the sample problems 
worked out below need not be described. Note that the 
answer is the same whether 50 or 60 is taken as the guessed 
mean. 

TABLE 41 

Quality of Penmanship as Judged with the Ayres Handwriting Scale 



Scores Ungrouped 




Deviation 
from 


Devia- 


Pupil 


Score 


Guessed 
Mean 


tion 
Squared 


I 


20 


— 30 


900 


2 


20 


— 30 


900 


3 


40 


— 10 


100 


4 


40 


— 10 


100 


5 


40 


— 10 


100 


6 


50 








7 


50 








8 


50 








9 


50 








ID 


50 








II 


50 








12 


60 


10 


100 


13 


60 


10 


100 


14 


60 


10 


100 


15 


60 


10 


100 


16 


60 


10 


100 


17 


70 


20 


400 


18 


70 


20 


400 


19 


70 


20 


400 


20 


70 


20 


400 


21 


80 


30 


900 


22 


80 


30 


900 


23 


90 


40 


1600 


24 


90 


40 


1600 


N -. 


= 24 Sum 


= 9200 


Mean : 


= 57.5 




Guessed 


Mean= so 








= 18.085 


S.D. = \ 


/9200 . -, 

Z^— (57-5-50)2 

24 



Scores Grouped in Step Intervals of lo 



Score 



IS — 25 
25 — 35 
35—45 
45 — 55 
55 — 65 
65 — 75 
75 — 85 
8S — 95 



Fre- 
quency 



Deviation 

from 

Guessed 

Mean 



— 40 

— 30 

— 20 

— 10 



30 



Fre- 
quency 
Times 
Devia- 
tion 2 



3200 



600 
00 

400 

800 

1800 



AT = 24 
Mean =: 57-5 
Guessed Mean = 60 



Sum = 8000 



S.D. = V 



8000 
"24" 



(60 — 57.5)"= 18.085 



The S.D, from a Median. — Because the S,D. may be 

computed from any average there is a popular supposition 
that the S.D. from a mean and a median are computed in 
exactly the same way. But deviations cannot be computed 
from a guessed median as they can be from a guessed mean, 
and corrected for in the formula 



S.D. 



V 



Sum of (dev.)' 



— (correction)' 




Statistical Methods — Variability Measures 387 

This formula holds for the mean only. A frequency distri- 
bution can be used but all deviations must be from the actual 
median, in which case the formula is identical with the 
above with the correction omitted. 

5.D. vs, Mn.D. or (?c— The per cents of the scores in- 
cluded by ±: Q, ±: Mn.D., and ±: S.D. are respectively, 
for a normal distribution, 50%, 57.5%, and 68%. Conse- 
quently there should be an increase in the sizes of the respec- 
tive variability measures. The facts are as follows: 



Problem I. 


Q = 


1.41. 


Mn. D. = 1.75 


5'. D. = 2.22 


Problem II. 


Q = 


11.66. 


Mn. D. = 14.42 


S.D. = 18.085 



The transmutation table given a few pages back shows 
that when the frequency distribution is normal (J =.6745 
S.D., Mn.D. = .7979 S.D., and Q = .8453 Mn.D. Had 
S.D. only been computed and transmuted into Mn.D. and 
Q, in Problem I, Mn.D. would have been 2.22 X .7979, or 
1.77, instead of 1.75, and Q would have been 2.22 X.6745, 
or 1.497, instead of 1.4 1. The transmutation formulae do 
not yield exactly the same results as calculation because we 
are not dealing with perfectly normal frequency distribu- 
tions. 

When to Use Each Variability Measure. — Use Total 
Range (a) for inspection purposes only, or (b) as a sup- 
plement to other variability measures. 

Use Q (a) when an easily and quickly computed measure 
which is reasonably satisfactory is desired, or (b) when Q^ 
and Q^ will be important supplementary information. 

Use S.D. (a) when it is desired to allow extreme scores 
to markedly influence the variability measure, or (b) when 
low unreliability is desired, or (c) when subsequent correla- 
tion or reliability formulae require the S.D. 

The Q and S.D. will suffice for practically every situation. 
There is little justification for continuing the use of either 
Mn.D. or M.D., and P.E. should be definitely reserved as 
a measure of reliability rather than variability. 



CHAPTER XVII 

STATISTICAL METHODS— RELATIONSHIP AND 
RELIABILITY MEASURES 

I. Relationship Measures 

What Is Correlation? — The idea of correlation is so 
familiar that it is found in literary masterpieces and in the 
fables of the street. This is especially the case with inverse 
or negative correlation. ''For every grain of wit there is a 
grain of folly." "The vulnerable heel of Achilles." ''The 
leaf spot of Siegfried." "Beauty vs. Brains." "Eye-minded 
vs. ear-minded." "Idea thinkers vs. thing thinkers." 

Thus correlation is a method for determining the corre- 
spondence and proportionality between two series of scores 
or measures for the same pupils, or the same schools, or the 
same cities, or any other entity. When the correspondence 
is perfect and positive the coefficient of correlation {r) is 
+ i.o, when it is perfect, but negative, r is — i.e. Corre- 
lation is positive when one series of scores tends to increase 
as the other increases, and negative when one tends to in- 
crease as the other decreases. A coefficient of correlation 
may be any size from +1.0 through o to — i.o. 

Test I Test II Test I Test III Test I Test IV Test I Test V 

Pupil Score Score Score Score Score Score Score Score 

A 2 62 12 2 62 12 

B 3 8 3 10 3 10 3 8 

C4104 84 84 10 

D 5 12 5 6 5 12 5 6 
r=^-\-i.o r = — 1.0 r = H- .8 r = — .8 

Some Uses of Correlation. — Here are some of the 
questions which education often asks and correlation can 

388 



Statistical Methods — Relationship and Reliability 389 

answer: How reliable is this mental or educational test? 
Does increasing its length or repeating it increase its relia- 
bility? Do these two tests measure the same aspect of 
reading ability, as they claim? Which one of a group of 
tests is most representative of all of them? Is there any 
justification for the popular assumption that pupils who are 
best in English tend to be poor in mathematics? Do those 
who work most rapidly in arithmetic tend to work most 
accurately? How reliable is a teacher's examination in 
history? How close is the agreement between a test and a 
teacher's judgment? How close is the agreement between 
school marks and success in life? These and hundreds of 
other such questions involving a relationship between two 
series of measures can be answered by correlation. 

Here are a few statements that correlation cannot make: 
When correlation is .8, 80% of the pupils show perfect cor- 
respondence. When correlation is positive but less than per- 
fect a larger score in one series always accompanies a larger 
score in the other series. When there is a high correlation 
between two series of facts one has caused the other, or cor- 
relation implies causal relation. 

How to Compute Correlation by the Standard 
Method. — There are several excellent methods for com- 
puting a coefficient of correlation. The product-moment 
method is the one most commonly used and generally ap- 
proved. The product-moment formula for calculating a 
coefficient of correlation is, 

Hixy 
r = ^-— 

which may be stated in this form, 

lixy 



V2 it^. S 3^2 

This formula is made clear by the simple problem in Table 
42 with its explanation. 



390 



How to Measure in Education 



TABLE 42 

Correlation Between the Scores of a Class on Courtis Addition 
Test, Series B, and Ayres Handwriting Scale (Standard 
Method) 





{-^ 


Dev. from Mean 








Pupil 


Score 
I II 


I II 

X y 


X^' 


/ 


xy 


A 


2 50 


— 5— 7.5 


25 


56.25 


37.5 


B 


3 50 


— 4— 7-5 


16 


56.25 


30.0 


C 


4 50 


— 3— 7-5 


9 


56.25 


22.5 


D 


4 80 


— 3 22.5 


9 


506.25 


— 6.7.5 


E 


5 20 


— 2 — 37-5 


4 


1406.25 


75.0 


F 


5 60 


— 2 2.5 


4 


6.25 


— 5-0 


G 


5 40 


— 2 — 17.5 


4 


306.25 


34-0 


H 


5 50 


— 2 — 7.5 


4 


56.25 


15.0 


I 


6 70 


— I 12.5 


I 


156.25 


— 12.5 


J 


6 40 


— I — 17.5 


I 


306.25 


17.5 


K 


6 70 


— I 12.5 


I 


156.25 


— 12.5 


L 


6 50 


— I— 7-5 


I 


56.25 


7-5 


M 


7 50 


0— 7.5 





56.25 


0.0 


N 


7 70 


12.5 





156.25 


0.0 





7 40 


0—17.5 





306.25 


0.0 


P 


7 70 


12.5 





156.25 


0.0 


Q 


7 60 


2.5 





6.25 


0.0 


R 


8 20 


1—37.5 


I 


1406.25 


— 37.5 


S 


8 60 


I 2.5 


I 


6.25 


2.5 


T 


8 90 


I 32.5 


I 


1056.25 


32.5 


U 


9 80 


2 22.5 


4 


506.25 


45-0 


V 


9 60 


2 2.5 


4 


6.25 


5.0 


w 


10 90 


3 32.5 


9 


1056.25 


97.5 


X 


12 60 


5 2.5 


25 


6.25 


12.5 


Mean 


7. 57.5 Sum or 2 124 


7850.00 


434.0 






\ 


— 135.0^ 




299.0 


*• [ 


2 xy 299 


29c 


lr= -303 


7 


VSx^'^y^ - V(i24) (7850) 


~ 986.^ 



I 



(i) The scores for tests I and II are paired according to the pupils 
making them. 

(2) By previous calculation the mean of test I is 7.0 and of test II 
is 57.5. Theoretically only the mean may be used in product-moment 
correlation, but in practice the median is sometimes used. 

(3) The scores from tests I and II are turned into deviations from their 



Statistical Methods — Relationship and Reliability 391 

own mean. The first column of deviations is called x and the second y. 

(4) Each X and each y is squared. The square of — 5 is 25, of — 7.5 
is 56.25 and so on down. 

(5) Each X is 'multiplied by its corresponding y. The product of — 5 
and — 7.5 is + 37.5 and so on down. 

(6) The sum of the x^'s and the y'^'s is computed. 

Sa;^= 124 

sy=78so 

(7) The sum of the plus xy's = 434, and of the minus xy's = 135. 
The algebraic sum of the two is determined. 2 rcy = 299. 

(8) Substituting these values in the formula and solving, r = .303. 

How to Compute Correlation by the Method of 
Ranks. — In a serious correlation study the student should 
use the standard formula, but when time is an important 
consideration and refinements are not essential, he may use 
the Spearman "Footrule" formula 

and transmute R into r by means of Table 44. Pearson has 
shown that the true correlation is not approximated by 
Spearman's R until it is transmuted into r by Table 44, 
which is constructed from Pearson's formula, 

r = 2 cos— (i — R) — I 
3 
The problem in Table 42 is recalculated by the Spearman 

method of ranks in Table 43. 

(i) The scores of test I are each given a relative rank. The score 2 is 
ranked first or i, score 3 is ranked 2, score 4 is ranked 3.5 because the two 
scores of 4 occupy ranks 3 and 4 whose average is 3.5; score 5 is ranked 
6.5 since the four scores of 5 occupy ranks 5, 6, 7, and 8, whose average 
is 6.5 and so on for the other scores. The scores in test II are also ranked 
in order beginning with the smallest score. There are two scores of 20 
occupying ranks i and 2 so each 20 is ranked 1.5. The three scores of 40 
occupy ranks 3, 4, and 5, whose average is 4 and so on for the remaining 
scores of test II. The largest score in tests I and II might have received 
the rank of i instead of the smallest, and in fact this is often done. Either 
method of ranking is correct provided the method is uniform for both 
tests, 

(2) Compute the gains in rank of test II over test I. Thus we have 
8.5 — I = 7.5 ; 8.5 — 2 = 6.5, etc. 

(3) The sum of the gains in rank is computed. S G = 74.5. 

(4) Substituting values in the formula and solving R = .224. Trans- 
muting R by Table 44, r= .37. 



392 



How to Measure in Education 



TABLE 43 

Correlation Between the Scores of a Class on Courtis Addition 
Test Series B and Ay res Handwriting Scale (Rank Method) 











Gain 


Pupil 


Score 
I II 


Rank 
I 


II 


in Rank 
G 


A 


2 50 


I 


8.5 


7-5 


B 


3 50 


2 


8.5 


6.5 


C 


4 50 


3.5 


8.5 


5.0 


D 


4 8o 


3.5 


21.5 


18.0 


E 


5 20 


6.5 


1.5 




F 


5 6o 


6.5 


14 


7.5 


G 


5 40 


6.5 


4 




H 


5 50 


6.5 


8.5 


2.0 


I 


6 70 


10.5 


18.5 


8.0 


J 


6 40 


10.5 


4 




K 


6 70 


10.5 


18.5 


8.0 


L 


6 50 


10.5 


8.5 




M 


7 50 


15 


8.5 




N 


7 70 


15 


18.5 


3-5 


O 


7 40 


15 


4 




P 


7 70 


15 


18.5 


3.5 


Q 


7 60 


15 


14 




R 


8 20 


19 


1.5 




S 


8 60 


19 


14 




T 


8 90 


19 


23.5 


4.5 


U 


9 80 


21.5 


21.5 




V 


9 60 


21.5 


14 




W 


10 90 


23 


23.5 


.5 


X 


12 60 


24 


14 





N 
R=i — 



24 



6^ G 



N' — I . 
By Table 44, r = .37 






2 G 
(74.5) 



(24)= 



74.5 
.224 



How to Interpret a Correlation Coefficient. — Is an 
r of .30 or .37, according to the formula used, ^'high" or 
"low"? With r^s as with intelligence, or wealth, or beauty, 
the customary criterion is that of relativity. There seems 
to be a sort of rough agreem_ent among workers in this field 
that when r is 



Statistical Methods — Relationship and Reliability 393 



TABLE 44 

Transmutation of R into r according to 



TT 



r = 2 cos — ^ ( I 
3 ^ 



R) 



1. R= I 



6 S G 

A^'— I 



R 


r 


R 


r 


R 


r 


R 


r 


.00 


.000 


26 


.429 


•51 


•742 


.76 


.937 


.01 


.018 


27 


.444 


•52 


.753 


•77 


•942 


.02 


.036 


28 


.458 


•53 


•763 


•78 


.947 


.03 


.054 


29 


•472 


.54 


.772 


.79 


.952 


.04 


.071 


30 


.486 


•55 


.782 


.80 


•956 


.05 


.089 


31 


.500 


.56 


.791 


.81 


.961 


.06 


.107 


32 


•514 


•57 


.801 


.82 


.965 


.07 


.124 


2>2> 


.528 


•58 


.810 


.83 


.968 


.08 


.141 


34 


.541 


•59 


.818 


•84 


•972 


.09 


.158 


35 


.554 


.60 


.827 


.85 


•975 


.10 


.176 


36 


.567 


.61 


•836 


.86 


•979 


.11 


.192 


■Z7 


.580 


.62 


.844 


. .87 


.981 


.12 


.209 


.38 


.593 


•63 


.852 


.88 


•984 


.13 


.226 


.39 


.606 


.64 


.860 


•89 


•987 


.14 


.242 


40 


.618 


•65 


.867 


.90 


.989 


.15 


.259 


.41 


.630 


.66 


•875 


.91 


.991 


.16 


•275 


42 


.642 


.67 


.882 


.92 


•993 


.17 


.291 


43 


.654 


.68 


.889 


.93 


.995 


.18 


.307 


.44 


.666 


.69 


.896 


•94 


.996 


.19 


.323 


45 


.677 


.70 


.902 


•95 


•997 


.20 


.338 


46 


.689 


.71 


.908 


.96 


.998 


.21 


•354 


47 


.700 


.72 


•915 


•97 


•999 


.22 


•369 


48 


.711 


'72, 


.921 


.98 


.9996 


.23 


.384 


49 


.721 


•74 


.926 


•99 


.9999 


.24 


•399 


50 


•732 


•75 


•932 


I.OO 


1. 0000 


.25 


.414 















o to zb .4 correlation is low, or 

zb .4 to ± .7 correlation is substantial, or 

di .7 to it: i.o correlation is high. 

There is, however, a more satisfactory way to interpret 
coefficients of correlation. When we have perfect correla- 
tion between two traits it is possible to predict accurately 
an individuaPs position in one of these traits from a knowl- 
edge of his position in the other. As the coefficient of corre- 
lation goes toward zero such predictions become more and 
more uncertain. When the coefficienHs-^exactly zero a pre- 



394 



How to Measure in Education 



diction has no more accuracy than a sheer guess or a purely 
chance estimate. Kelley has worked out the data of Table 
45. According to this table, when r = o the error of pre- 
diction is 1. 00, where i.o is defined as a sheer guess. When 
r = .1 the error has been reduced to .995. The coefficient 
of correlation must be about .85 before the error is half 
way between a guess and perfect prediction. Slight in- 
creases in the size of the coefficient above this point cause 
a rapid decrease in the error of prediction. 

TABLE 45 

Shows Decreases in the Error of Prediction from i.oo toward Zero 
with Increases in r from Zero toward i.o, Where an Error of 
I.oo Is a Sheer Guess and an r of i.oo Is Perfect Correlation 



r 


Error 


.00 


1. 000 


.10 
.20 


.995 
.9798 


.30 
.40 

•50 
.60 


.9539 
.9165 
.8660 
.8000 


.70 
.80 


.7141 
.6000 


.85 


.5268 


.90 


.4359 


•95 


.3122 


•97 


•2431 


.99 


.1411 



Equal in importance to the highness or lowness of an r is 
its reliability. As will be seen presently this is determined 
by the size of the r and the number of pupils. An r of .30 
from only 24 pupils is relatively very low, has an error of 
prediction of .954, and is besides very unreliable. 

When May Correlation Be Applied? — Both of the 
formulae given assume a rectilinear or straight line relation- 
ship. Fig. 33 shows how to plot the scores of Table 43 to 
show whether their relationship is rectilinear or curvilinear. 



Statistical Methods — Relationship and Reliability 395 

In so far as there is any drift of the points at all, the 
drift is forward and upward roughly in a straight line direc- 
tion. The great dispersion of the points indicates low cor- 
relation. Fig. 34 shows rectilinear relationship coupled with 



90 
























80 
























70 










00 


00 










60 



























50 




























40 

























30 






















20 

























23456 7 8 9 10 II 12 

Fig. 33. Shows Rectilinear Relationship with an r of .303. 
(Data from Table 43.) 

higher correlation. Fig. 35 shows three kinds of problems 
plotted on one pair of axes, (A) perfect positive correlation, 
(B) perfect negative correlation, (C) curvilinear relation- 
ship which cannot be treated by methods given in this book. 



90 
80 


























70 
60 











00 


00 



00 









so 












00 


00 










40 

























30 
20 

























23456 7 8 9 10 II 12 

Fig. 34. Shows Rectilinear Relationship with an r of .8 

Self-Correlation Coefficients. — Self-correlation is the 
correlation between two duplicate tests given to the same 
pupils. Its chief function is to show whether one test is a 
sufficently accurate measure of each pupil. Reliability is 



396 How to Measure in Education 

one criterion for evaluating a test. Self-correlation is one 
statistical technique whereby a test's reliability may be de- 
termined. If the self -cor relation between two duplicate 
tests is i.o, then one test is an absolutely accurate measure 
of each pupil in the trait which the test measures. This ideal 
is of course never attained. 

How high should self-correlation be? No absolute stand- 
ard can be given that will fit every situation. Where test 
results are used to commit children to institutions or to ex- 
clude them from important social or educational opportuni- 
ties and the like, or where results are to be used for close 



90 

















80 


















70 


















60 
















X 


50 














X 




40 











X 







30 









X 









20 







X 











XXX 

xc 



23456789 10 II 12 

Fig. 35. A — Rectilinear Relationship with an r of + i.o. B — Rectilinear Rela- 
tionship with an r of — i.o. C — Curvilinear Relationship, 

theoretical reasoning self -correlation should certainly be 
above .9. But such a criterion is too drastic for most prac- 
tical purposes, since the range of self -cor relation for most 
standard tests is about .5 to about .9, while the range for 
typical teachers' examinations is much lower. A criterion 
of .9 or above would disqualify most educational tests and 
forbid as a public nuisance a professor's examination. Clara 
Chassell has found that the self -cor relation of the marks 
of college professors on students who were rated through 
four full years is only .80 ! If the coefficient is not satisfac- 
torily high it is evidence that one of two things needs to be 
done: (a) The test must be lengthened. How much it must 
be lengthened can be determined by computing the new 
correlation between the lengthened test and a duplicate of 



Statistical Methods — Relationship and Reliability 397 

it. (b) If the test is not lengthened or not lengthened 
enough it must be repeated. How many times to repeat 
can be determined empirically by giving a test and its dupli- 
cate twice each and correlating the two series of averages, 
and if that is not enough, by giving each test three times and 
correlating averages, etc. 

But this empirical process is very expensive in time, since 
twice as many tests as are needed must be given before it 
can be determined just how many are needed. The use of 
Spearman's prophecy formula will save half of this time. 

_ Nr^ 

^^— i + (iV_i)ri 
If the self-correlation of one test with a duplicate (r^) 
is .8, and the information sought is how many times {N) 
the test must be given to yield a desired coefficient (rx) of 
.9, substitute as follows and solve for N: 

^_ N(.S) 

•^ i + (A^ — 1).8 
N = 2,2$ times 
If the information sought is the rx which would result 
from giving the same test or similar tests four times, sub- 
stitute as follows and solve for rxi 

4 (.8) 

I + (4 — 1) .8 

Suppose that r^ or .8 were the self-correlation between the 
average of two duplicate tests and the average of two other 
similar tests. In that case the N required to yield a self- 
correlation of .9 would be 2.25 X 2 or 4.5. The second 
formula would be interpreted as follows: 4 pairs of tests or 
8 duplicate tests in all will yield an Tx of .941. 

Other Relationship Measures. — Correlation is not the 
only method for computing relationship. It is probable that 
the beginner will do better to compute relationship as fol- 
lows: (i) Express each of the two series of scores as a 
deviation from its own average. (2) Divide these deviations 
in each series by the S.D. of that series in order to equate 



398 How to Measure in Education 

variability. (3) Find the difference between the two 
equated deviations for each pupil. In doing this have regard 
for signs. In case there is a perfect relationship all these 
differences will be zero. Any difference larger than zero 
shows the amount of displacement in terms of S.D. (4) 
Make a frequency distribution of the differences. (5) Com- 
pute the mean or median to determine the average amount 
of S.D. displacement. 

II. Reliability Measures 

What Are Unreliability Measures?— Imagine that in 
a certain city there are 1000 pupils in the sixth grade. We 
wish to know the mean on an arithmetic test for the entire 
1000, but there is only time to measure 100 of them. If 
100 pupils are selected as nearly as possible at random from 
the 1000, and these 100 are measured, and the mean and 
S.D. of their scores is computed, it is possible by means of 
an unreliabihty formula to discover the limits within which 
the true mean for the entire 1000 will fall. Similarly, by 
use of the appropriate formula, it is possible to determine 
the limits within which the true median, or the true Q or 
the true S.D. for the entire 1000 pupils will fall. 

But to better understand the unreliability formulae, 
imagine again that the 1000 pupils are measured in ten 
groups of 100 each selected as nearly at random as possible. 
This will yield ten means, no one of which will probably 
agree exactly with any other or with the mean of all ten 
which is the true mean. Just as it is possible to compute 
(a) the mean and (b) S.D. for 100 pupil scores, so it is 
possible to compute (c) the mean of the ten means, and (d) 
the S.D. of the ten means. These four measures are called 
respectively, (a) obtained mean, (b) S.D. distribution, (c) 
the true mean, and (d) S.D. mean. S.D. mean is a measure 
of the unreliability of any one of the ten means for it is a 
measure of the variability among the ten means and hence 
is an index of each one's most probable divergence from the 



Statistical Methods — Relationship and Reliability 399 

true mean. S.D. mean, then, measures the unreliability not 
of the mean of the ten means for, since it is the true mean, 
it has no unreliability, but of the unreliability of any one of 
the ten means. 

To illustrate, suppose that actual measurement of the 
1000 pupils in groups of 100 showed the following: 



For the 


first 


100 pup 


(( 




second 


<( (( 


(( 




third 


« (( 


(( 




fourth 


(( (( 


(I 




fifth 


(< t( 


<( 




sixth 


« (( 


a 




seventh 


(S il 


n 




eighth 


« (( 


(( 




ninth 


(< (( 


« 




tenth 


U it 




Then 




True 


mean = 25. 


S.D. 


mean = i 


.1 



Mean 




Score S. D. Distribution 


25 


9 


23 


10 


24 


12 


27 


14 


25 


10 


26 


II 


24 


13 


26 


12 


25 


II 


25 


8 



True S.D. distribution =11. 
S.D. S.D. = 1.7 



Just as S.D. mean, i. e., the reliability of any one mean 
from 100 pupils, was computed from the ten means, and just 
as S.D. S.D., i. e., the reliability of any one S.D. distribution 
was computed from the ten S.D. distribution, so it would 
be possible to compute in similar fashion a reliability meas- 
ure for a median (S.D. median) ^ or for a Q (S.D. q), or for 
an r (5.Z). r), and so on. 

The function of reliability formulae is to prophecy from 
just one sampling what these S.D.'s would be if an indi- 
vidual were to go through the labor of testing all the pos- 
sible samplings (in the above case 10). Thus the reliability 
formula for the mean is 

S.D. distribution 
S.D. mean = === 

V N 

Suppose, instead of testing ten groups of 100 pupils and 
actually determining empirically the S.D. mean, that just 



400 



How to Measure in Education 



the first 100 had been tested and the S.D. distribution com- 
puted, and the S.D. mean determined through the reUability 
formula. This would have been a great saving of time and 
would have given an S.D. mean only two-tenths in error as 
compared with the true S.D. mean of i.i. Thus: 

S.D. mean = — = = .9 

V 100 

Suppose, for illustration, we enquire about the unrelia- 
bility of some of the measures already calculated. One of 
the problems carried throughout recent chapters is sum- 
marized in Table 46. 

TABLE 46 

Frequency Distribution of the Number of Examples Done Cor- 
rectly on Courtis Addition Test Series B, together with 
Certain Statistical Measures Which Have Previously Been 
Calculated 



Score 


Frequency 


Statistical Results 


2 — 3 

3 — 4 


I 

I 


Mean = 7.0 
Median = 7.0 


4—5 
5-6 
6-7 

7-8 


2 

4 
4 

5 


Q = 1.4 

S, D. = 2.22 


8-9 
9 — 10 


3 
2 


r ^^ .303 (with Ay res Handwriting) 


10 — II 


I 




II — 12 







12—13 


I 





N 



24 



Unreliability of the Mean in Table 46. — The unrelia- 
bility of mean 7.0 is shown by the following formula: 

S.D. distribution 
S.D. mean = — 



y N 



S.D. mean = 



2.22 



V 24 



.45 



Statistical Methods — Relationship and Reliability 401 

To say that the unreliability of the mean or S.D. mean is 
.45 is not particularly illuminating to most people. But 
unrehability can be stated more intelligently. It is cus- 
tomary to call practical certainty ± 3 S.D, measure. If we 
are concerned with the mean, practical certainty is ± 3 S.D. 
mean, if our concern is with the median, practical certainty 
is zt 3 SD, median, if with the S.D. it is d= 3 S.D. S.D. and 
so on. 

We can be practically certain, then, that the true mean is 
between 7.0 — 3 (.45) and 7.0 + 3 (.45), or, as it will be 
written hereafter, 7.0 ± 3 (.45), i. e., we can be practically 
certain that the true mean is somewhere between 5.65 and 

8.35. 

But the true mean of what falls between 5.65 and 8.35? 
It is not the true mean of the 24 pupils, because by actual 
computation we know that their true mean is 7.0. It is the 
true mean of any larger group of which the 24 pupils are an 
attempted random sampling. If these 24 pupils were a 
perfectly random sampling of some larger group the mean 
7.0 would be the exact mean for the larger group. But we 
can scarcely hope to make a perfect sampling. Let us 
assume that the 24 pupils are, so far as can be made, a ran- 
dom sampling, or chance selection from all fifth-grade pupils 
in New York City. According to the data we can feel 
assured that the true mean for all New York City fifth-grade 
pupils is somewhere between 7.0 ± 3 (.45) • There is always 
a possibihty that the true mean is below 5.65 or above 8.35, 
but the chance of this being so is exceedingly small. The 
chances are in fact only 3 in 10,000. We cannot be prac- 
tically certain, but the chances are very great that the true 
mean is between 7.0 ± 2 (.45). The chances are sub- 
stantial that the true mean is between 7.0 ± i (.45)- J^st 
what the chances are will be shown later. The above treat- 
ment applies to S.D. measure, i. e., to any measure of unre- 
liability. The formulae for the most important of these 
follow. 



402 How to Measure in Education 

Unreliability of the Median in Table 46. — 

1% S.D. distribution 
S.D. median = = 

y N 

i}i X 2.22 
S.D. median = = — = -57 

V 24 

We can be practically certain that the true median of the 
group of which the 24 pupils are a random sampling is be- 
tween 7.0 d= 3 (.57). 

Observe that in this as well as in the previous un- 
rehability formula, reliability depends upon two factors: the 
S.D. distribution and N. Reliability may be increased by 
either decreasing the variability or by increasing the num- 
ber of pupils. 

Unreliability of the Q in Table 46. — 

I.I I S.D, distribution 
S.D. Q= = 

V 2iV 
I. II X 2.22 

S.D. Q= --^ ^.36 

V 2 X 24 

We can be practically certain that the true Q is between 

1.4 ± 3 (-36). 

Unreliability of the S.D. in Table 46. — 

S.D. distribution 
S.D.S.D. = = — - 

V 2N 
2.22 

S.D. S.D. = -^= = .32 

V 2 X 24 

We can be practically certain that the true S.D. is between 

2.22 ± 3 (.32) 

Unreliability of the r in Table 46. — 

I — r- 
S.D.r = —= 

S.D. r = === — .13 

V 24 
The true r is almost certainly between .303 + 3 (.13). 
Unreliability of a Difference. — This is one of the most 



Statistical Methods — Relationship and Reliability 403 

useful of all the unreliability measures, especially for ex- 
perimental work where there is an experimental and control 
group. The diference between the means, medians, or other 
measures for these two groups determines the conclusion 
from the experiment. 

The unreliability of this difference determines the value of 
the conclusion. The following abbreviated formula makes 
most conclusions slightly conservative. 



S.D. difference =^ V (S.D. measure iY-\-{S.D. measure 11)^. 

Suppose that Experimental Group I of 25 pupils were 
taught by a new method and showed a mean improvement 
of 18 with an S.D. distribution of improvements of 4, while 
the Control Group II of 36 pupils showed a mean improve- 
ment of 16 with an S.D. distribution of improvements of 3. 
The difference in mean improvements, 18 — 16, is 2. Is this 
difference so reliable that we can be practically certain that 
if the experiment were repeated upon similar groups, the 
difference would not become zero or actually favor Group 
II? Before we can compute the unreliability of a difference 
it is necessary to compute the unreliability of the measures 
with whose difference we are concerned. The total process 
is shown below. 

4 3 

S.D. mean i = ■ — = .8 S.D. mean 11 = - — - = .5 

V25 V36 

S.D. difference = V (.8)2 + (.5)2 = .94 + 

We can be practically sure that the true difference be- 
tween the two improvement means will be between the 
obtained difference 2 d= 3 (.94), i. e., between — .82 and 
4.82. Evidently there is some chance that the true difference 
is zero or even below zero. For the true difference to go 
below zero would make the experiment favor Group II. The 
chances are, however, much greater that the true difference 
is above zero and favorable for Group I and the new teach- 
ing method. The way to determine just how much greater 
the chances are is shown later. 



404 How to Measure in Education 

It is possible to compute the unreliability of the difference 
between any two measures. We are far more often con- 
cerned with differences between means, but the formula for 
S.D. difference is applicable to the difference between 
medians, Q's, S.D/s, r's, etc. Just as the formula for the 
unreliability of a mean was used in order to compute S.D. 
mean i and S.D. mean ii, preliminary to substituting these 
measures in the formula for S.D. difference, so it is neces- 
sary to compute S.D. r i and S.D. r ii or S.D. Q i and 
S.D. Q II according to the formula for the unreliability of 
an r and of a Q respectively, before substituting in the 
formula for the unreliability of a difference. 

Experimental Coefficients. — Because the unreliability 
of a difference is so extensively used in experimentation, and 
because it is difficult for some people to think in terms of 
chances, I have devised what may be called an experimental 
coefficient. This experimental coefficient is easily computed 
and automatically shows in a relatively non-technical 
fashion just how unreliable any difference is. We dis- 
covered in computing S.D. difference above that the un- 
reliability of the obtained difference of 2 is .94. The 
experimental coefficient is represented by the following for- 
mula: 

Difference 

Experimental Coefficient = — 3 — ^rT7 j.-r 
^ 2.78 S.D. difference 

Since the difference is 2 and the S.D. difference is .94, 

Experimental Coefficient = — r — ; = .76 

2.78 X .94 

The experimental difference of 2 is only .y6 as large as it 
needs to be in order that we may be practically certain that 
the new method of teaching is truly better than the method 
used with the control group. Had the difference been 2.61 
instead of 2 the experimental coefficient would have been as 
follows: 

Experimental Coefficient = — -^ = 1.0 

2.78 X .94 



Statistical Methods — Relationship and Reliability 405 

An experimental coefficient of i .0 is just exactly practical 
certainty. An experimental coefficient of .5 means half cer- 
tainty, one of 2.0 means double certainty and so on. 

If a statement is desired, in terms of chances, of the proba- 
bility that the true difference is a zero difference or less, i. e., 
actually favors a conclusion opposite to the one obtained, 
such a statement in terms of chances, once the experimental 
coefficient has been computed, may be read directly from 
Table 47. This table applies to a difference between any 
two obtained measures as truly as it applies to a difference 
between obtained means. 

TABLE 47 



Shows How to Convert an Experimental Coefficient into a 
Statement of Chances 



Experimental Coefficient 


Approximate Chances 


.1 


1.6 to I 


.2 


2.5 


to I 


.3 


3.9 


to I 


.4 


6.5 


to I 


.5 


II 


to I 


.6 


20 


to I 


.7 


38 


to I 


.8 


75 


to I 


.9 


160 


to I 


i.o 


369 


to I 


I.I 


930 


to I 


1.2 


2350 


to I 


1.3 


6700 


to I 


1.4 


20000 


to I 


1-5 


67000 


to I 



How to Transmute S,D. Measure into P,E, Measure. 

It has already been noted that in a normal frequency dis- 
tribution, P.E,, M.D., and Q are equal and that P.E. equals 
.6745 S,D. It is somewhat more conventional to express 
unreliability in terms of P.E. than in terms of S.D. 
P.E. measure is found by multiplying S.D. measure by 



40 6 How to Measure in Education 

.6745. One of the above formulae is repeated to illustrate 
this transmutation. It is similar for other formulae. 

S.D. distribution 
S.D. mean = 



P.E. mean = 



^/ N 
.6745 S.D. distribution 



How to Interpret Unreliability. — Illustrations have 
already shown how to interpret practical certainty, which 
means that the chances are roughly 369 to i that the true 
measure is between the obtained measure ± 3 S. D. measure. 

Other approximate chances are given below: 

The chances are 2.15 to i the true 

measure is between obtained measure =h i S.D. measure 

The chances are 2 1 to i the true meas- 
ure is between obtained measure ± 2 S.D. measure 

The chances are 369 to i (practical 
certainty) the true measure is be- 
tween obtained measure zh 3 S.D, measure 

The chances are i to i the true measure 

is between obtained measure zb / P.E. measure 

The chances are 4.6 to i the true meas- 
ure is between obtained measure zb 2 P.E. measure 

The chances are 22 to i the true meas- 
ure is between obtained measure ±z 3 PE. measure 

The chances are 142 to i the true 

measure is between obtained measure dz 4 PE. measure 

The chances are 369 to i (practical 
certainty) the true measure is be- 
tween obtained measure :± 4.4 P.E. measure 



I 



Statistical Methods — Relationship and Reliability 407 

Summary for Statistical Methods.. — Two problems 
have been used for most of the illustrations in order to re- 
veal the continuous and cumulative nature of statistical 
processes. To return to the simplicity of childhood, these 
processes are as continuous and interdependent as ^'The 
House That Jack Built." For example, this is a frequency 
surface, that yields a distribution, that yields a central ten- 
dency, that yields a variability, that makes a correlation, 
etc. As further evidence of this continuity and as a test of 
the student's mastery of the technique described the follow- 
ing problems have been solved. It is suggested that the 
student verify the answers. 

TABLE 48 



Sample 


Computation of Some Common 


Statistical 


Measures. 






(Approximate Answers.) 






Score 


Freq. 


Freq. 


Freq. 


Freq. 


Score 


Freq. 


Freq. 


Freq. 


Freq. 




I 


II 


Ill 


IV 






I 


II 


Ill 


IV 


— 2 


3 


I 


25 


I 


— 


10 


I 


3 


I 


50 


2— 4 


4 


I 


25 


I 


10 — 


20 


I 


4 


I 


50 


4— 6 


4 


2 


50 


1 


20 — 


30 


2 


4 


I 


100 


6— 8 


5 


2 


100 





30 — 


40 


2 


5 





200 


8 — 10 


5 


4 


300 


I 


40 — 


50 


4 


5 


I 


600 


10 — 12 


6 


5 


200 


4 


50 — 


bo 


5 


6 


4 


400 


12 — 14 


4 


4 


100 


2 


60 — 


70 


4 


4 


2 


200 


14 — 16 


4 


3 


50 





70 — 


80 


3 


4 





100 


16 — 18 


3 


2 


25 





80 — 


90 


2 


3 





50 


18 — 20 


2 


I 


25 


2 


90 — 


100 


I 


2 


2 


50 


Mode 


11.00 


11.00 


9.00 


11.00 






55-00 


55.00 


55-00 


45-00 


Mean 


9.5s 


10.76 


9.89 


10.50 






53.«o 


47-75 


52.50 


47-45 


Median 


9.60 


11.00 


9.67 


11.00 






55-00 


48.00 


55-00 


48-35 


Qi 


5.50 


8.13 


8.17 


7.00 






40.65 


27.50 


35-00 


40.85 


Qs 


13-50 


13.88 


11-75 


13.00 






69.40 


67.50 


65.00 


58.7s 


Q 


4.00 


2.88 


1.79 


3.00 






14.40 


20.00 


15.00 


8.95 


S.D. from Mean 


5.08 


4-39 


3.54 


5.30 






21.95 


25-40 


26.50 


1.77 


S.D. mean 


.80 


.88 


.12 


1-53 






4-40 


4.00 


7-65 


.04 


P. E. 


.43 


.46 


.06 


.81 






2.30 


2.15 


4-05 


.02 



SUPPLEMENTARY READING FOR PART III 



Alexander, Carter. — School Statistics and Publicity; 

Silver, Burdett & Company, New York, 1919. 
Brinton, Willard C. — Graphic Methods for Presenting 

Facts; The Engineering Magazine Company, New 

York, 191 7. 



4o8 How to Measure in Education 

King, W. I. — Elements of Statistical Methods; The Mac- 

millan Company, New York, 191 2. 
RouTZAHN, E. G., and Mary S. — The A B C of Exhibit 

Planning; Russell Sage Foundation, New York, 1918. 
RuGG, Harold O. — Application of Statistical Methods to 

Education; Houghton Mifflin Company, New York, 

1916. 
Scott, Walter Dill. — The Psychology of Advertising; 

Small, Maynard & Company, Boston, 1910. 
Thorndike, Edward L. — An Introduction to the Theory of 

Mental and Social Measurements; Teachers College, 

Columbia University, New York, 19 13. 



APPENDIX 

HOW TO SECURE TESTS 
AND DIRECTIONS FOR THEIR USE 

This book has attempted to give the fundamental pro- 
cedure for any type of mental measurement. But it has 
been impossible in a book of this size and inappropriate 
in a book of this character to describe at length the existing 
tests and scales. Tests are changing at a phenomenal rate 
and changing for the better. It is the function of frequent 
bulletins issued by book companies and bureaus of research 
to inform educators of the latest and best tests. All the 
important centers which distribute testing material are pre- 
pared to send free or practically free literature describing 
their tests. More than this they are glad to give expert 
advice as to the test or tests which it is best to use in a par- 
ticular situation. Finally, they are prepared, for a small 
charge, to send for inspection sample tests. Again, the 
bureaus which issue tests usually do and always should send 
with the tests which have been ordered a leaflet giving 
detailed directions for applying and scoring the tests, for 
tabulating results, and for computing pupil and class scores. 
The directions usually include norms for the test and fre- 
quently suggestions for the uses of results. As a precau- 
tion the individual, when writing for tests, should request 
that all necessary directions for properly using them be sent. 

The following are the chief centers for the distribution 
of tests: 

Bureau of Publications, Teachers College, New York City. 

Public School Publishing Co., Bloomington, 111. 

World Book Company, Yonkers-on-Hudson, N. Y. 

C. H. Stoelting Company, Chicago, 111. 

409 



410 How to Measure in Education 

The World Book Company has just issued a booklet en- 
titled Bibliography of Tests for Use in Schools, which sells 
for ten cents. This booklet gives tests sold by other agencies 
than themselves. To date this is probably the most complete 
list of tests ever assembled. The other centers mentioned 
above also have descriptive booklets for the tests which they 
distribute. 

The following references also contain elaborate lists of 
tests or bibliographies on tests or both: 
Holmes, Henry W., and Others. — A Descriptive Bibli- 
ography of Measurement in Elementary Subjects; Har- 
vard University Press, Cambridge, Mass., 191 7. 
National Society for the Study of Education. — Seven- 
teenth Year Book, Part II; Public School PubHshing 
Company, Bloomington, 111., 191 8. 
RuGER, Georgie J. — Bibliography on Psychological Tests; 
Bureau of Educational Experiments, New York, 19 18. 
Whipple, Guy M. — Manual of Mental and Physical Tests, 
Vols. I and II; Warwick & York, Baltimore, 1910. 



INDEX 



Abilities, relative importance of, 146, 

147, 148. 

Abstract intelligence, 173, 174. 

Accomplishment Quotient, in read- 
ingVSs, 86, 87, 149, 150; and effi- 
ciency, 150-156. 

Adams, on problem solving, 100, loi. 

Adaptation, of instructions, 241, 242, 

243- 

Age scale, construction of, 256; in- 
terpretation of, 256, 257, 258; 
evaluation of, 291-307; zero point 
and unit for, 295, 297, 298; 
norms for, 315, 316. 

Analogy, test, 197. 

Analysis of test results, diagnosis 
by, 97-102. 

Aptitude, in educational guidance, 
83, 84, 85; in vocational guid- 
ance, 177-183. 

Arithmetic, diagnosis and treatment 
of, 91-98, 100; types of examples 
in, 202, 203 ; how pupils reason 
in, 100, loi. 

Ashbaugh, spelling scale, 203. 

Average, discussion of, 366-378. 

Aviation, test for, 228, 229. 

Ayres, on Rice and Thorndike, 14; 
16; on entering age and subse- 
quent progress, 33 ; spelling scale, 
203; on product scales, 263, 264, 
265 ; on graphic methods, 347. 

Ballou, on type examples, 202. 

Bases, of classification, 19, 20. 

Bibliography, for tests, 411. 

Binet-Simon, intelligence scale, 78, 
82, 258, 295. 

Bingham, on doing vs. telling, 183. 

Bobbitt, on tabular methods, 329. 

Bridges, on interest and ability, 185. 

Brinton, on graphic methods, 341. 

Buckingham, 15; Illinois examina- 
tion, 258; zero point, 294. 



Calibrator, use of in scaling, 287, 
288, 289. 

Capacity, to learn, 80-86; see Ac- 
complishment Quotient or Intelli- 
gence Quotient. 

Carney, C. S., on special disabili- 
ties, 181. 

Cattell-Fullerton, theorem of, 15; 
validity of theorem of, 267, 268. 

Central tendency, see Average. 

Certain, C. C, on project testing, 
247. 

Chance, in True-False test, 121, 122, 
123; causes of normal curve, 357. 

Chances, statement of, 405, 406. 

Chapman, 204. 

Character, traits of gifted pupils, 65, 

Chassell, Clara, on professors' marks, 
396. 

Clark, on vocational placement, 
169. 

Classification, bases of, 19, 20; by 
mental age, 21, 22, 23; by educa- 
tional tests, 23-58; by teacher's 
judgment, 58-63 ; by promotion 
age, 61, 62; objections to, 62-66; 
tests for, 24; accuracy of, 22, 42- 
46, 51, 52; table for, 45-48; rules 
for, 48 ; illustration of, 49, 50 ; suc- 
cess of, 52-56; procedure for in 
large school, 55, 56, 57. 

Coaching, avoidance of, 234, 23S, 

311- 

Committee, on Graphic Presenta- 
tion, 332-342. 

Composite, computation of, 25-37. 

Comprehension, visual and memory, 

136, 137. 
Comprehensiveness, in test, 201, 202, 

203. 
Consensus, of associates, 175, 176, 

177- 
Contrast of opposites, diagnosis by, 
loi, 102. 



411 



412 



Index 



Correlation, with test criterion, 195- 
227; partial coefficients of, 216- 
221; and test reliability, 310; in- 
terpretation and uses, 388, 389, 
393; computation of, 389-396; 
and prediction, 393-398; reliabil- 
ity of, 402. 

Correspondence, 203, 204; see Rela- 
tionship. 

Courtis, 15; on diagnosis of arith- 
. metical defects, 90-96; practice 
tests, 112, 113, 114; on efficiency 
of Gary schools, 165, 166; speed 
and accuracy conversion formula, 
252; goal scale, 252; on Cattell- 
Fullerton theorem, 267, 268. 

Coy, on ambitions of gifted pupils, 
188. 

Crathorne, on stability of interest, 
185, 186. 

Criterion, of validity, 195, 204, 208, 
209; of intelligence, 210, 211. 

Cumulative total, 300-307. 

Curve, see Diagrams, 

Curvilinearity, see Rectilinearity. 

Davis, on vocational guidance, 169. 
Dearborn, intelligence tests, 79, 
Demotion, see Classification. 
Developmental history, diagnosis by, 

lOI. 

Diagrams, construction of, 332-354; 
types of, 344-349; selection of, 
348-352; preparation of, 350, 351; 
reproduction of, 352, 353. 

Diagnosis, of initial ability, 67-88; 
functions of, 77, 78, 88, 89; of 
general and specialized capacity, 
79-86; methods of, 89-112; pre- 
requisites of skill in, 109, no, in. 

Dickson, on relation of intelligence 
to school work, 21. 

Directions, for tests, 69-77, 135, 139, 
235-249, 410. 

Dollinger, on interest and ability, 
185. 

Duplicate tests, construction of, 305, 
306; and self-correlation, 309, 310, 
311. 

Educational age, computation of, 36, 
37, 38, 257; vs. mental age, 38, 
39, 40. 

Educational Quotient, computation 
of, 36, 37, 38, 257; vs. Intelli- 
gence Quotient, 40, 41, 42; in 
classification, 46-52. 



Efficiency, of study and instruction, 
150-169. 

Eliot, and life career motive, 169. 

Emphasis, regulation of, 17, 18, 144- 
148; distribution of, 166. 

Empirical, test, 198. 

Examination, illustration, construc- 
tion, application, scoring, and ad- 
vantages of True-False, 119-134, 
scaling of, 289, 290, 291. 

Examiners, directions for, 248. 

Exercise, from practice tests, 117, 
n8. 

Experimental coefficient, computa- 
tion and interpretation of, 404, 
405. 

Foote, on tests, 8; on objectives, 78, 

167; on norms, 316. 
Footrule, for correlation, 391, 392, 

393. 

Form, of test, 227-236. 

Franzen, on Accomplishment Quo- 
tient, 39, 40; on classification, 55; 
on gifted pupils, 64; on efficiency 
measurement, 153. 

Frequency distribution, construction 

of, 359-364- 

Frequency surface, overlapping of, 
42-46; construction and interpre- 
tation of, 354-360. 

Fullerton-Cattell, theorem of, 15; 
validity of theorem of, 267-268. 

Gifted pupils, see Classification and 
Accomplishment Quotient and In- 
telligence' Quotient; vocational 
guidance of, 187, 188, 189. 

Goal, see Objective. 

Grade, norms, 32-37, 285; adjust- 
ment of norms for, 25. 

Grade scale, construction of, 258- 
263 ; evaluation of, 291-307. 

Grade unit, computation o^ 157- 
164. 

Grading, see Classification. 

Graphs, see Diagrams. 

Gray, W. S., oral reading test, 138. 

139, 140. 
Greene, organization test, 231. 
Group, vs. individual testing, 233. 

234, 235. 

Haggerty, intelligence test, 79; or 

combining units, 300-306. 
Health, of gifted pupils, 64, 65. 



i 



Index 



413 



Henmon, tests for aviators, 197; on 
method of combining units, 300- 
306. 

Hillegas, iS; on product scale tech- 
nique, 265, 266, 267. 

Histogram, 351. 

Hollingworth, H. L., on character 
and vocational guidance, 174; on 
consensus of associates and self- 
analysis, 175, 176, 177, 178; on 
test types, 197, 198. 

Hollingworth, Leta, on diagnosis or 
spelling, 102-110. 

Horoscope, for vocational guidance, 
177, 178. 

Individuality, in instruction, 114, 
IIS, 116. 

Individual testing, see Group. 

Initiative, effect of tests upon, 17, 
18, 144-148. 

Instructions, principles for con- 
structing, 235-249; brevity and 
adequacy of, 235, 236, 237; with 
demonstration and preliminary 
test, 237-242; adaptation and uni- 
formity of, 241, 242, 243; for in- 
telligence tests vs. educational 
tests, 241, 242; order of, 243, 244, 
245; and action units, 245, 246; 
and interest, 246, 247, 248; for 
examiners, 248. 

Intelligence, and diagnosis, 11 1; an- 
alysis of, 211, 212, 213; measure- 
ment of, 213-227. 

Intelligence Quotient, vs. Educa- 
' tibnal Quotient, 40, 41, 42 ; in 
classification, 57; interpretation 
of, 79-86. 

Interest, absence of, no; stimula- 
tion of, 116, 117, 133-150; in 
vocational guidance, 184-185; sta- 
bility of, 185, 186; and intelli- 
gence tests, 220, 221. 

Introspection, diagnosis by, 89. 

Irrelevancy, and validity, 198-202. 

Jones, frequency of occurrence 

scale, 252. 
Jordan, Arthur, on norms, 316. 
Judd, on classification, 20, 21; on 

graphic methods, 349. 

Kelley, T. L., on age-grade inter- 
val, 34; on correction for over- 
lapping, 45; on combining units, 
302; on error of prediction, 394. 



Kirby, on practice tests, 116. 
Kruse, on overlapping, 43, 44, 204. 

MacKnight, on ambitions of gifted 

pupils, 188. 
^ Marks, for pupils, 57-63, i54. ^SS- 
Mass measures, discussion of, 354- 

365. 

Maturity, and efficiency, 165; and 
intelligence measurement, 221. 

McCall, on correlation of mental 
and educational tests, 21; reading 
scale, 68. 

McComas, telephone test, 197. 

McMurry, F. M., on goals, 11. 

Mean, computation of, 366-371; re- 
liability of, 400, 401. 

Mean deviation, computation of, 
380, 381, 382. 

Measurement, scope of, 10, 11, 12; 
ancillary, 12, 13; evolution of, 14, 
15, 16; and mechanization, 17, 18. 

Mechanical, tests, 17, 18, 145-148; 
tabulation, 323. 

Mechanical intelligence, 173, 174. 

Median, computation of, 370-377; 
reliability of, 401, 402. 

Meine, 204. 

Memory comprehension, see Com- 
prehension. 

Mental age, relation of to school 
work and grade position, 21, 22; 
vs. educational age, 38, 39, 40; 
scale, 78, 315, 316; estimation of, 
141, 142. 

Midscore, computation of, 376, 377. 

Miniature, test, 197. 

Mode, computation of, 365. 

Monroe, W. S., on type principle 
of selection, 202, 203; Illinois ex- 
amination, 258; on method of 
combining units, 300-306. 

Morton, on retardation, 23. 

Multiplier, use of, 31, 32. 

Neural connections, and intelligence, 

211, 212, 213. 
Non-verbal, tests, 78, 238. 
Norms, date adjustment o!f, 25; 

grade into age, 32-37; age, 280- 

286; grade, 285; criteria for, 313- 

318. 

Objective, location of, 140-148. 

Objectivity, in scoring, 227-234; im- 
portance of, 311, 312; measure- 
ment of, 312, 313. 



414 



Index 



Observation, diagnosis by, 89, 90, 91. 
Optimum interval, 309, 310, 
Oral, trade test, 205-210. 
Oral tracing, diagnosis by, 95, 96, 

97. 
Order distribution, 364. 
Organization, of neural connections, 

212; of test material, 227-236. 
Otis, intelligence test, 231. 
Overlapping, causes of, 42, 43, 44, 45. 

Palmistry, for vocational guidance, 
177, 178. 

Parsons, on vocational guidance, 169. 

Paterson, performance scale, 78; on 
norms, 315, 316. 

P. E., in grade scale, 258-264; con- 
stancy of, 262, 263; in product 
scale, 265-272; validity of, 267- 
272; computation of, 379, 405,406. 

Pedagogical age, significance of, 57- 
60; computation of, 60, 61. 

Percentiles, computation of, 253, 254, 

370-377- 
Percentile scale, construction of, 253, 

254; interpretation of, 254, 255, 

256; evaluation of, 291-307. 
Pantomime, test, 238. 
Performance scale, see Percentile 

scale, and Age scale, and Grade 

scale, and Pintner. 
Personnel, 183, 184. 
Phrenology, for vocational guidance, 

177, 178. 
Physical, vs. mental measurement, 5, 

6, 7; defects and diagnosis, no, 

III. 
Physiognomy, for vocational guid- 
ance, 177, 178. 
Pintner, performance scale, 78; on 

percentile indices, 255, 256; on 

scaling total scores, 305; on 

norms, 315, 316. 
Placement, of new pupils, 57. 
Point measures, discussion of, 

365-378. 
Practice tests, description of, 112, 

113, 114; value of, 114-119. 
Preliminary test, 237-242. 
Pressey, Primer Scale, 79; Mental 

Survey, 230. 
Product-Moment, correlation, 388, 

389, 390. 
Product scale, construction of, 263- 

268; peculiarity of, 270, 271; 

evaluation of, 291-307; transmu- 



tation of, 299, 300; method of 

combining units for, 300-306. 
Promotion, see Classification. 
Promotion age, computation of, 61, 

62. 
Prophecy, technique of, 216-221; 

formula for and error of, 394, 397. 
Purposes, importance of, 146, 147, 

148. 

Qi, computation of, 370-377, 

Q3, computation of, 370-377. 

Quantitative, vs. qualitative, 3, 4, 
17, 18, 144-148. 

Quartile deviation, in weighting, 30, 
31; computation of, 379, 380; re- 
liability of, 402. 

R, see Footrule. 

r, see Product-moment. 

Random-sampling, and validity, 201, 

202. 
Rank distribution, 364. 
Reading, initial ability in, 67-79; 

analysis of, 97, 98, 99; informal 

tests of, 134- 141; objectives In, 

140-145. 
Reading age, computation of, 73; as 

an objective, 140, 141, 142. 
Reading Quotient, computation of, 

75. 

Reclassification, see Classification. 

Rectilinearity, of relation line, 216- 
221, 394, 395. 

Relationship measures, discussion 
of, 388-399. 

Reference point, 291-296. 

Repression, technique of, 216-221. 

Reliability, sources of, 307, 308, 309 ; 
measurements of, 309, 310; meth- 
od of increasing, 310, 311; of col- 
lege marks, 396; computation 
and interpretation of, 398-408; 
of mean, 400, 401 ; of median, 
402 ; of Q, 402 ; of S.D., 401 ; of 
r, 402 ; of a difference, 402-406. 

Retardation, amount of, 23. 

Rice, place in measurement, 14, 

IS- 
Richards, on the quantitative, 8. 
Robinson, 204. 

Rogers, on prognostic tests, 84. 
Ruger, proverbs test, 230. 
Rugg, H. O., on tabulation methods, 

323 ; on step intervals, 360. 
Ruml, 204. 



Index 



415 



Sampling, test, 197. 

Scale, percentile, 253-257; reasons 
for, 249-253; goal, 252; fre- 
quency-of-occurrence, 252 ; age, 
256, 257, 258; grade, 258-264; 
product, 263-272; T, 272-307. 

Science, prerequisites of, 7, 8, 9. 

Scott Company, 183. 

Scoring, economy in, 227-234; me- 
chanical devices for, 231, 232, 

233. 
Seashore, musical tests, 180. 
Self- analysis, method of, 175, 176, 

177. 
Self-correlation, see Correlation. 
Sigma, see Standard deviation. 
Simple total, 300-307. 
Simpson, on norms, 314, 315. 
Social age, relation of to mental 

age, 65, 66. 
Social intelligence, 173, 174. 
Social-worth, principle of selection, 

203. 
Spearman, Footrule formula, 390 ; on 

self -correlation prophecy, 396. 
Speed, of silent reading, 136, 137; 

of oral reading, 138, 139; trans- 
muted into accuracy, 252. 
Spelling, analysis of, 102-109. 
Standard deviation, in T scale, 272- 

307; computation of, 383-388; 

reliability of, 402. 
Standards, see Objective. 
Step interval, determination of, 360, 

361; limits of, 361, 362, 363. 
Stone, C. W., 15. 
Stratton, tests for aviators, 197. 
Strayer, on retardation, 23. 
Subject age, 257. 
Subjectivity, see Objectivity. 
Subnormal pupils, see Gifted pupils; 

vocational guidance of, 189. 

T scale, construction of, 272-292; 
norms for, 280-286; extension of, 

285, 286; increase of accuracy for, 

286, 287; short cuts for, 289, 290, 
291 ; comparative evaluation of, 
291-307; reference point for, 291- 
296; unit for, 295-301; method of 
combining units for, 300-307. 

T score, in reading, 72-75; as an ob- 
jective, 142; as a scale unit, 272- 

307- 
Tables, construction and placement 

of, 325-331. 



Tabulation, types of, 321, 322, 323; 
selection of form for, 324, 325. 

Teachers' judgment, see Pedagogi- 
cal age. 

Terman, on mental age and school 
work, 21, 22; on mental-age grade 
intervals, 34; on I.Q. distribu- 
tion, 41 ; on teachers' estimate of 
pupils, 58, 59; on health of gifted 
children, 64; on social develop- 
ment of gifted pupils, 65 ; on I.Q., 
79; on intelligence limits for vo- 
cations, 171; on linguistic irrele- 
vancies, 199; on methods of meas- 
uring intelligence, 222-227; as a 
scale constructor, 288, 289. 

Tests, purchase of, 410. 

Thomdike, on educational measure- 
ments, 6, 7; place in measure- 
ments, 14, 17; reading scale, 68; 
college entrance tests, 82 ; on 
analysis of reading, 97, 98, 99; 
tests for clerical workers, 180; on 
present methods of vocational 
placement, 184; on interest and 
ability, 185; on weighting, 215- 
221; on aviation test, 228, 229, 
230; pantomime test, 238; vs. 
Ay res' handwriting scale, 263, 264, 
265; as a scale constructor, 296, 
297, 298; on combining units, 300- 
306; on causes of normal curves, 

357. 

Toops, 204, 

Total range, computation of, 379, 
380. 

Trabue, 15, 17; zero point, 294; on 
combining units, 300-306. 

Trade-ability, in vocational guid- 
ance, 182, 183, 184. 

Transfer, of training, 166, 221. 

Transients, 276. 

True-False, see Examination. 

Type, principle of selection, 202, 
203. 

Uhl, on diagnosis in arithmetic, 96, 

97- 
Undistributed, scores, 240, 250, 

251. 

Uniformity, in results, 17, 18; m 
instructions, 241, 242, 243. 

Unit, percentile, 253-257; growth, 
256, 257, 258; grade variability, 
258-264; variability-of-adult-per- 
formance, 263, 264, 265; variabil- 



4i6 



Index 



ity-of-judgment, 265-272; evalua- 
tion of, 295-3C1. 
Unreliability, see Reliability. 

Validation, of tests, 195-227; of 

trade test, 205-210. 
Variability, and weighting, 30, 31, 

32. 

Variability measures, discussion of, 
378-388. 

Visual comprehension, see Compre- 
hension. 

Vocational guidance, functions of, 
169, 170; intelligence limits in, 
170-175; moral and physical lim- 
its in, 174-178; for gifted and un- 
gifted pupils, 187, 188, 189. 



War Department, bulletin, 171, 

172. 
Weber's law, 268. 
Weights, how to effect, 31, 32; for 

subordinate traits, 216-221. 
Wiley, 204. 

Williams, on norms, 167, 316. 
Woodworth-Wells, directions tests, 

lOI. 

Woody, arithmetic scales, 202, 203; 
on grade scale technique, 258- 
263; zero point, 294; on combin- 
ing units, 300-306. 

Yerkes, on army mental tests, 16. 

Zero point, 291-296, 



